Iterative Modulo Scheduling

Updated at:19/06/2025

Document information

Author	B. Ramakrishna Rau
School	Hewlett Packard
Major	Compiler and Architecture Research
Company	Hewlett Packard
Document type	Research Report

Language	English
Format	\| PDF
Size	2.96 MB

Summary

I.Modulo Scheduling Optimizing Loop Performance

This document explores modulo scheduling, a powerful technique for software pipelining that significantly improves the performance of innermost loops. Unlike traditional methods like "unroll-before-scheduling" which suffer from code size explosion, modulo scheduling aims to create a repeating schedule (kernel) for a single loop iteration, with a fixed initiation interval (II) between iterations. This minimizes code expansion while maximizing performance. The optimal II is the smallest possible value, determining the execution speed, and is bounded from below by the minimum initiation interval (MII). The MII is calculated considering both resource constraints (ResMII) and recurrence dependences (RecMII).

1. Introduction to Modulo Scheduling and Existing Techniques

The document begins by introducing the concept of cyclic scheduling schemes designed to enhance Instruction-Level Parallelism (ILP) by strategically moving operations across iteration boundaries. It contrasts this approach with existing methods like "unroll-before-scheduling," which involves unrolling loops multiple times to apply a global acyclic scheduling algorithm. While this method achieves some overlap between iterations, it creates a scheduling barrier at the back-edge, leading to performance degradation. Increasing unrolling reduces the degradation but at the cost of increased code size and scheduling complexity. The inherent challenges lie in the lack of clarity regarding which operations to move, in which direction, and how many times for optimal results. Excessive code motion can lead to increased register pressure and reduced performance. The introduction then highlights an alternative, modulo scheduling, presented as a potential solution for handling loops with control flow in a near-optimal way, and notes that unroll-while-scheduling presents advantages in handling loops with control flow. Modulo scheduling is introduced as a 'schedule-then-move' approach offering a framework and constraints for creating legal schedules.

2. Modulo Scheduling Core Principles and the Initiation Interval

This section delves into the fundamental principles of modulo scheduling. The primary goal is to design a schedule for one loop iteration that can be repeatedly applied at regular intervals (the initiation interval or II) without violating intra- or inter-iteration dependencies or causing resource conflicts. Unlike unrolling techniques, modulo scheduling exhibits limited code expansion, potentially requiring no expansion with appropriate hardware support. The creation of the modulo schedule determines all implied code motions and the code structure, including branch placement. The process involves several steps, with the main one being the actual modulo scheduling. It discusses handling the absence of rotating registers, leading to kernel unrolling for modulo variable expansion, and the generation of prologue and epilogue code sequences for various loop types (DO-loops, WHILE-loops, loops with early exits) considering the presence or absence of predicated execution and rotating registers. The section further mentions comparisons between iterative modulo scheduling and the Petri Net approach to unroll-while-schedule strategies. It emphasizes the practical implications of the algorithm described in the report and notes its capability to handle realistic machine models and its evaluation to determine schedule quality and computational complexity. The benefits of software pipelining are highlighted as highly dependent on the workload and context.

3. Minimum Initiation Interval MII and Resource Usage

The section focuses on determining the minimum initiation interval (MII), a lower bound on the smallest achievable II for a valid modulo schedule. A smaller II is desirable for faster execution. The MII calculation considers critical resources and dependence chains within the loop iterations. It explains the calculation of the resource-constrained MII (ResMII), derived from the loop body's resource usage requirements. The use of rotating register files as hardware support for efficient modulo scheduling is mentioned, along with the concept of modulo variable expansion (virtual register renaming) to handle anti- and output dependencies. This section delves into modeling and tracking resource usage. The resource usage of operations is defined using reservation tables, which specify resource usage times relative to the operation's issue time. The concept of collision and cross-collision vectors between reservation tables is explained and the possibility of using synthetic reservation tables to simplify the representation is discussed. Examples illustrate how reservation tables and recurrence dependences can create conflicts that affect scheduling feasibility at the MII. The computation of the ResMII involves sorting operations by their degrees of freedom and iteratively selecting alternatives to minimize resource usage. The algorithm's computational complexity (O(N³)) motivates strategies to minimize the number of invocations.

II.Determining the Initiation Interval and Schedule

Finding the optimal schedule involves iterative processes. The document describes an iterative modulo scheduling algorithm. This algorithm starts with a candidate II (often the MII) and iteratively refines the schedule, potentially increasing the II if a feasible schedule cannot be found. The process involves managing resource usage using reservation tables to avoid conflicts, and carefully handling dependences both within and between iterations. The algorithm effectively balances the need for a small II with the constraints imposed by the processor's resources and data dependencies.

1. Iterative Modulo Scheduling Algorithm

A core component of the document is the description of an iterative modulo scheduling algorithm. This algorithm is crucial for determining both the initiation interval (II) and the optimal schedule. The process is iterative, starting with a candidate II (often initialized to the Minimum Initiation Interval or MII). The algorithm repeatedly attempts to create a schedule, and if it fails to find a feasible solution at the current II, it increases the II and tries again. This iterative refinement is key to finding a valid schedule that meets all resource and dependency constraints. The algorithm's iterative nature, repeatedly scheduling and rescheduling operations, continues until a 'fixed-point' solution is reached, satisfying all scheduling constraints. The algorithm incorporates backtracking; if a dead-end is reached, the algorithm revisits previous scheduling decisions and explores alternative paths. This iterative approach allows the algorithm to handle the complexities of resource allocation and data dependencies within the loop structure. The success of the algorithm depends heavily on the choice of heuristics used to guide the search and manage backtracking.

2. The Role of the Initiation Interval II

The initiation interval (II) plays a central role in modulo scheduling, representing the fixed time interval between the starts of successive loop iterations. A smaller II directly translates to faster execution time. Therefore, the primary objective of the algorithm is to find the smallest possible II that allows for a legal schedule. The algorithm begins by selecting a candidate II, often starting with the MII, and then iteratively increases it until a valid schedule is found. The choice of II significantly affects scheduling complexity; a larger II provides more flexibility in scheduling operations, but at the cost of potentially slower overall execution. The document discusses how the II and the schedule are determined through an iterative process and the implications of II for the creation of the kernel, prologue and epilogue code. A larger II creates later deadlines for recurrence operations and more available time slots. It also creates more time slots for resource usage on the modulo reservation table, simplifying non-overlapping placement. The strategy of starting iterations at fixed II intervals is compared to alternative methods like limiting the span and using pacemaker functions, highlighting the flexibility in the number of simultaneously executing iterations allowed by the proposed algorithm.

3. Kernel Prologue and Epilogue Generation and Scheduling

Once a suitable II and schedule are determined, the algorithm generates the kernel, prologue, and epilogue code. The kernel represents the repeating portion of the schedule, efficiently representing multiple iterations. The prologue and epilogue handle the initial and final stages of the loop's execution, respectively, bridging the gap between the initial state and the repeated kernel, ensuring the transition from the prologue to the kernel and the kernel to the epilogue is smooth. The scheduling of the loop-closing branch operation is described as subject to a constraint dependent on the II and the branch latency. The target of the branch always being the first instruction of the kernel. If the II is smaller than the branch latency, the kernel must be unrolled to accommodate proper branch placement. The algorithm efficiently handles various loop types, including DO-loops, WHILE-loops, and loops with early exits, generating customized prologue and epilogue sequences. The algorithm may either adapt the kernel's schedule to the prologue and epilogue, or schedule them separately, whilst honoring the kernel's constraints. The process ensures a relatively small amount of code approximates a completely unrolled and scheduled loop, achieving significant efficiency gains.

4. Search Strategies and Heuristics

The document explores different search strategies for finding a valid modulo schedule. It differentiates between deterministic, goal-directed searches and semi-random approaches like simulated annealing and genetic algorithms. While semi-random methods avoid local minima, they are computationally expensive, making them unsuitable for production compilers. Goal-directed searches, guided by heuristics, are more efficient but risk getting trapped in local optima. The algorithm proposed uses a combination of heuristics and backtracking, iteratively refining the schedule by unscheduling conflicting operations and re-scheduling them, attempting to reach a fixed-point solution. The document discusses employing secondary heuristics to escape dead-end states, emphasizing the importance of choosing good heuristics to guide the search for a valid schedule and to select the next state when a dead-end is encountered. Backtracking is discussed as a method to escape dead-ends, involving revisiting previous decisions in the search space, but points out challenges in exploiting information from failed attempts to improve subsequent choices. The iterative nature of the scheduling algorithm is described as repeatedly scheduling and rescheduling operations until a solution is found.

III.Handling Recurrences and Complexities

The presence of recurrence dependences (cycles in the dependence graph) adds considerable complexity. The algorithm addresses this by identifying strongly connected components (SCCs) within the loop and treating them as atomic units. The algorithm prioritizes scheduling operations that impact the MII. Heuristics guide the search for a feasible schedule, navigating through a search space that can be vast. Backtracking mechanisms handle dead-end states, ensuring the algorithm doesn't get stuck in local optima. The algorithm aims to achieve a near-optimal aggregate execution time.

1. Handling Recurrence Dependences

A significant challenge in modulo scheduling arises from recurrence dependences—cycles in the dependence graph. These cycles represent dependencies between operations within the same iteration and across multiple iterations. The presence of recurrences complicates the scheduling process because it introduces circular dependencies, making it difficult to determine a consistent, repeatable schedule. The algorithm addresses this by first identifying strongly connected components (SCCs) within the dependence graph. SCCs represent sets of operations with circular dependencies. The algorithm then handles these SCCs, which contain all cycles, as a separate problem. Once an SCC is scheduled, it's treated as an atomic unit, simplifying the overall scheduling process. This modular approach allows the algorithm to deal with the intricate relationships within the recurrence cycles without losing the overall structure and flow of the schedule. This is crucial because the presence of cycles significantly alters the way dependencies are managed and resolved.

2. Strategies for Managing Resource Conflicts

Another significant challenge stems from resource conflicts. Multiple operations may compete for the same processor resources (e.g., functional units, buses, registers). This section discusses resource usage modeling and tracking using reservation tables. These tables describe when each resource is used by specific operations. The algorithm employs reservation tables to prevent conflicts; if scheduling an operation causes a resource conflict, the algorithm must resolve the conflict by potentially unscheduling another operation, creating a backtracking mechanism. The algorithm uses a clever approach to manage these conflicts, ensuring that all constraints are met eventually. The algorithm prioritizes ensuring that dependencies with predecessor operations are honored. It achieves this by calculating earliest start times (Estart) for operations, preventing scheduling an operation before its predecessors complete. The impact of using the immediate Estart (versus the more computationally expensive transitive Estart) in the context of recurrences and the iterative nature of the scheduling is discussed. The backtracking is managed through ignoring successor operation dependencies initially; violations are addressed when these successor operations are scheduled and Estart is recalculated. The section also explains how to handle the situation when all time slots result in a resource conflict. A specific decision-making process is incorporated into the algorithm to handle such scenarios, ensuring forward progress and avoiding infinite loops where operations endlessly displace one another.

3. Comparison with Other Approaches to Recurrence Handling

The document examines the handling of recurrences in other modulo scheduling algorithms for comparison. This section highlights the difference in handling recurrences and the implications for the overall schedule quality. It mentions previous modulo schedulers that focus on the difficulty of scheduling operations on recurrence circuits and the common practice of scheduling SCCs first. A comparison is made with Lam's algorithm, which schedules SCCs separately, ignoring inter-iteration dependencies initially. This independent scheduling can lead to a larger RecMII. A key difference is that Lam's algorithm determines the schedule for each SCC in isolation, which then determines the II for the entire loop, while the algorithm described in this report determines the II before scheduling individual SCCs. It also points out how Lam's approach, creating macro-operations with complex reservation tables, makes it harder to schedule at the minimum achievable II. The enhanced modulo scheduling algorithm by Warter et al. is also discussed, combining aspects of other approaches. They allow the scheduling of two operations with mutually exclusive predicates on the same resource at the same time; however, it also results in an achievable II rarely exceeding the MII. The document contrasts these approaches with the iterative modulo scheduling strategy, which shows better performance.

IV.Comparison with Other Scheduling Techniques

The document compares iterative modulo scheduling with other loop scheduling approaches such as "unroll-while-scheduling" schemes. While other methods like those employing extensive code unrolling might potentially achieve better results in certain cases, they often come at a significantly higher computational cost. Iterative modulo scheduling offers a balance between schedule quality and computational efficiency. The algorithm achieves near-optimal performance (within 2.8% of the lower bound) while maintaining reasonable compilation times.

1. Iterative Modulo Scheduling vs. Unrolling Techniques

This section directly compares iterative modulo scheduling with other loop scheduling approaches, primarily focusing on the differences between iterative modulo scheduling and techniques that rely on unrolling the loop body before or during scheduling. The "unroll-before-scheduling" approach involves unrolling the loop many times, then applying a global acyclic scheduling algorithm to the unrolled code. This can lead to significant code expansion, increasing code size and compilation time. The performance gains are limited by the scheduling barrier at the back-edge, though increasing the unrolling extent can mitigate this issue further. In contrast, "unroll-while-scheduling" techniques attempt to unroll concurrently with scheduling. The document notes that iterative modulo scheduling achieves a balance between schedule quality and computational efficiency. The iterative modulo scheduling algorithm typically requires less computational effort than unrolling methods that replicate the loop body by more than 106%. While other approaches may offer the potential to surpass modulo scheduling, particularly in handling multiple control flow paths, the lack of sufficient performance characterization makes a direct comparison challenging. The document highlights that iterative modulo scheduling achieves execution times within 2.8% of the lower bound, with an average cost equivalent to scheduling 2.06 operations per operation in the original loop, thereby exhibiting an efficiency advantage.

2. Modulo Scheduling Compared to Other Approaches

The discussion expands on the comparison of iterative modulo scheduling with other established techniques, specifically focusing on the algorithms of Lam and Warter et al. The comparison highlights the handling of recurrences and resource utilization. Lam's algorithm, for example, utilizes the SCC structure but schedules each SCC separately, ignoring inter-iteration dependences initially. This can lead to suboptimal results. The algorithm described in the document addresses these issues by tackling SCCs in a coordinated fashion, considering inter-iteration dependencies, unlike Lam's method, which fixes the schedule of each SCC in isolation, possibly resulting in a larger RecMII. The algorithm avoids this by determining the initiation interval (II) before scheduling each SCC. The algorithm also outperforms Lam's approach by often enabling scheduling at the minimum achievable II instead of resorting to the higher II frequently encountered in Lam's algorithm due to complex reservation tables for macro-operations that represent the SCCs. The enhanced modulo scheduling algorithm of Warter et al., which employs predicated execution and handles unions of resource usages, is also discussed, although its achievable II rarely exceeds the MII, indicating a potential performance limitation. The document notes that there is limited literature on modulo scheduling applied to realistic machine models, highlighting the unique contributions of the presented algorithm.

3. Integer Linear Programming and Other Optimization Methods

The document briefly touches upon alternative approaches for finding optimal modulo schedules, particularly the integer linear programming (ILP) approach. While ILP is theoretically capable of finding optimal solutions that minimize both the II and register usage, the computational cost of solving integer linear programs is considered far too high for practical implementation in production compilers. Therefore this method is dismissed as impractical for real-world compiler applications. The document acknowledges that other researchers have proposed using linear programming to find optimal schedules assuming unbounded resources. However, it points out that this approach mainly provides estimations of quantities such as RecMII, earliest start times, and lower bounds for register needs, lacking the practical aspects that iterative modulo scheduling has. The comparison ultimately highlights the efficiency advantage of the iterative modulo scheduling algorithm presented; it is able to generate near-optimal schedules with far lower computational overhead, which is necessary for a real-world compiler implementation.

V.Experimental Results and Evaluation

The effectiveness of iterative modulo scheduling was evaluated using a comprehensive set of benchmarks from sources like the Perfect Club, Spec benchmarks, and Livermore Fortran Kernels (LFK). The evaluation focused on the resulting II and the overall execution time. The results show that the algorithm generates near-optimal schedules for a large majority of loops (96% optimal in II). The average cost of the algorithm is equivalent to scheduling roughly two operations per operation in the original loop. This efficiency is a key advantage over other techniques.

1. Experimental Setup and Benchmark Suites

The experimental evaluation of the iterative modulo scheduling algorithm involved a substantial set of benchmark loops. The input data for the scheduler was derived from the Perfect Club benchmark suite, Spec benchmarks, and Livermore Fortran Kernels (LFK), using the Fortran77 compiler for the Cydra 5 processor. The Cydra 5 compiler served as a basis for selecting candidate loops, rejecting those that were not DO-loops, had early exits, contained procedure calls, or exceeded 30 basic blocks before IF-conversion. For the selected loops, the intermediate representation (after load-store elimination, recurrence back-substitution, and IF-conversion) was used as input to the research scheduler. Profile statistics gathered for basic block execution frequencies were used to estimate the impact of scheduling heuristics on execution time. A total of 1327 loops were evaluated (1002 from Perfect Club, 298 from Spec, and 27 from LFK), with 597 loops actually executed during benchmark runs with the prescribed data input. This extensive benchmark set ensured a thorough evaluation across different program characteristics and loop structures. The analysis of the benchmark dataset revealed that the number of operations per loop was generally quite small, skewed toward smaller values with a long tail, which the document attributes to the presence of many initialization loops within the benchmarks.

2. Metrics for Evaluating Schedule Quality

The evaluation focused on key metrics to assess the quality of the generated schedules. The primary metric used was the initiation interval (II), as it is the primary determinant of performance, except for very short trip counts. The schedule length (SL) for one iteration was considered a secondary metric. The total execution time for a loop was calculated using the formula: EntryFreq*SL + (LoopFreq-EntryFreq)*II, where EntryFreq represents the number of times the loop is entered, and LoopFreq is the number of times the loop body is traversed. This formula assumes no processor stalls. The analysis acknowledges that for loops with small trip counts, the coefficient of SL is comparable to that of II; however, in most cases, II dominates the execution time. The comparison of actual execution time to a lower bound (not necessarily achievable) was employed as the ultimate measure of schedule quality. The lower bound was computed using lower bounds for both SL and II. The experimental results indicate that a large percentage of loops (54%) achieved the lower bound execution time and no loops exceeded a 1.5 times the lower bound execution time. Overall, the aggregate execution time was only 2.8% longer than the lower bound, demonstrating high efficiency in schedule quality. The computational complexity of the algorithm was also considered, aiming to approach the lower bound of acyclic list scheduling to keep the compilation time reasonable.

3. Analysis of Results and Impact of Budget Ratio

The experimental results demonstrated that the iterative modulo scheduling algorithm achieved near-optimal schedules. The algorithm generated schedules that were optimal in II for 96% of the loops in the benchmark suite. It also produced a near-optimal aggregate execution time for all loops, only 2.8% greater than the lower bound, highlighting its efficiency and effectiveness. The experimental evaluation also explored the impact of the 'BudgetRatio' parameter, which influences the algorithm's search effort. Using a larger BudgetRatio helps to find schedules for smaller II values, improving performance but at the expense of increased compilation time. Too small a BudgetRatio results in trying successively larger II values until a solution is found which is less efficient. Conversely, increasing BudgetRatio beyond the point where the minimum achievable II is reached increases computational complexity without performance gains. This suggests the existence of an optimum BudgetRatio that balances code quality and compilation time. The reported average cost of the algorithm is compared to the computational complexity of other approaches, underscoring its advantage regarding efficiency and scalability.

VI.Related Work and Conclusion

The document reviews existing literature on modulo scheduling, highlighting the contributions of this work in handling realistic machine models and complex scenarios. It discusses how the algorithm addresses challenges posed by predecessors and successors' dependences, resource conflicts, and the computation of earliest start times (Estart). The conclusion emphasizes that iterative modulo scheduling represents a practical and efficient approach to software pipelining, offering a compelling balance between schedule quality and computational complexity. It also surpasses alternative methods in terms of efficiency when considering code expansion.

1. Review of Existing Modulo Scheduling Literature

The document begins the section by acknowledging a scarcity of literature discussing modulo scheduling algorithms within the context of finite, realistic machine models. It highlights that previous work often focused on the difficulty of scheduling operations on recurrence circuits due to deadline constraints. This led to schemes prioritizing the scheduling of Strongly Connected Components (SCCs) first. The author contrasts this with their own work at Cydrome, both in a prototype scheduler and a production compiler, where all SCCs were iteratively modulo scheduled together in a first phase, followed by a second phase incorporating the remaining operations. This approach differed from previous methods in its heuristics (more computationally expensive and often producing worse schedules). The review points to the limitations of existing methods; the lack of attention to realistic machine complexities and resource constraints has driven the development of this specific work which aims to provide a more nuanced and practically useful algorithm.

2. Comparison with Other Modulo Scheduling Algorithms

The section continues by comparing the presented algorithm with other notable modulo scheduling approaches. It specifically discusses Lam's algorithm, which uses the SCC structure but list-schedules each SCC independently, neglecting inter-iteration dependencies initially. This approach treats each scheduled SCC as a macro-operation, leading to a potential increase in the RecMII and difficulty in scheduling at the minimum achievable II. The presented algorithm is contrasted as superior due to its coordinated scheduling of all SCCs, leading to a more efficient determination of the Initiation Interval (II). The algorithm of Warter et al. is also reviewed; it is described as combining the best features of other methods, allowing for concurrent scheduling of operations with mutually exclusive predicates, but still falling short of the presented algorithm's efficiency and ability to achieve a near-optimal II. The document also references the use of Integer Linear Programming (ILP) as a theoretically optimal but computationally impractical method for modulo scheduling due to its high complexity, highlighting the efficiency and practical applicability of the proposed iterative approach.

3. Conclusion and Summary of Contributions

The conclusion summarizes the key findings and contributions of the research. The iterative modulo scheduling algorithm is presented as generating near-optimal schedules, demonstrated by its achieving optimal II values for 96% of the benchmark loops and a near-optimal aggregate execution time only 2.8% above the lower bound. This high performance is achieved efficiently, outperforming methods that utilize extensive unrolling or code replication, especially those that replicate the loop body by more than 106%. The algorithm's efficiency is highlighted as a significant advancement, managing resource constraints and data dependencies within realistic machine models. The iterative nature of the algorithm, despite the additional complexity, proves to be highly economical in the computational resources required. The algorithm's effectiveness extends to a wide variety of loop structures (DO-loops, WHILE-loops, loops with early exits) and machine architectures (with or without predicates, rotating registers). The overall conclusion emphasizes the algorithm's practical value for compiler optimization, offering a robust and efficient approach to modulo scheduling for software pipelining.