AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER. B.S., University of Portland, 2000 THESIS

Swing Modulo Scheduling for Superblocks

Document information

Author

Tanya M. Lattner

instructor/editor Vikram Adve
School

University of Illinois at Urbana-Champaign

Major Computer Science
Document type Thesis
Place Urbana-Champaign
Language English
Format | PDF
Size 566.88 KB

Summary

I.Limitations of Traditional Modulo Scheduling and the Proposed Solution

This thesis addresses the limitations of existing modulo scheduling algorithms, such as Swing Modulo Scheduling (SMS), which only handle single basic block loops and miss opportunities for further Instruction Level Parallelism (ILP). The research proposes extending SMS to support superblocks—groups of basic blocks with a single entry and multiple exits—to improve loop optimization and enable scheduling of hot paths identified through profiling. This extension increases the number of loops amenable to modulo scheduling, enhancing performance.

1. The Fundamental Limitation of Traditional Modulo Scheduling

The core problem identified is that conventional Modulo Scheduling algorithms, including Swing Modulo Scheduling (SMS), are limited to optimizing single basic block loops. This restriction significantly hinders their ability to fully exploit Instruction Level Parallelism (ILP). By confining their operations to individual basic blocks, these algorithms overlook valuable opportunities for instruction reordering and parallel execution that exist across multiple basic blocks. The consequence is a suboptimal utilization of processing resources and a missed potential for performance enhancement. The document explicitly states that this limitation leads to "missing opportunities for further Instruction Level Parallelism by only handling single basic block loops." This underperformance highlights the urgent need for improved techniques capable of handling more complex loop structures.

2. Introducing Superblocks An Enhanced Approach to Modulo Scheduling

To overcome the limitations of traditional Modulo Scheduling, the thesis proposes the use of superblocks. A superblock is defined as a collection of basic blocks characterized by a single entry point and multiple exit points. This structure allows for a more comprehensive analysis and optimization of loops that span multiple basic blocks. By extending Swing Modulo Scheduling (SMS) to accommodate superblocks, the algorithm can now be applied to a wider range of loops, thereby increasing its applicability and effectiveness. The advantage of utilizing superblocks is that it addresses the limitation of previous algorithms by enabling optimization across multiple basic blocks, thus unlocking additional potential for Instruction Level Parallelism (ILP) and performance improvements. The document highlights that this extension "increases the number of loops Swing Modulo Scheduling can be applied to." The adoption of superblocks is a crucial step towards more effective loop optimization.

3. Optimizing Hot Paths and Runtime Considerations

Beyond the structural enhancement of using superblocks, the proposed method also focuses on identifying and optimizing hot paths within loops. Hot paths represent the most frequently executed sequences of instructions within a loop. By concentrating optimization efforts on these critical sections, the overall performance impact is maximized. The identification of hot paths can be achieved using profile information, either offline or at runtime. This dynamic approach allows for adaptive optimization based on actual program execution characteristics. The thesis suggests that performing Modulo Scheduling on these hot paths, which are also characterized by single entry and multiple exits, will be optimized later offline or at runtime. This implies that the algorithm is designed to be flexible and adaptable to the specific needs of different programs. The ability to apply Modulo Scheduling at runtime, made possible by the low compile-time overhead, is a significant advantage, allowing for dynamic adjustments based on observed execution patterns, maximizing efficiency.

4. Future Research Directions Expanding Architecture Support and Optimization Strategies

The thesis concludes by outlining avenues for future research. One key area is the implementation of Swing Modulo Scheduling (SMS) on other architectures, specifically mentioning IA64 as a promising platform. IA64’s architecture, with its extensive register file, long latency instructions that don’t block other instructions, and support for speculative execution, would allow the extended SMS algorithm to leverage these features for enhanced ILP and better performance gains. A second area of future work focuses on utilizing the extended SMS algorithm for superblock loops in a runtime or profile-guided optimization context. The ideal scenario involves forming superblocks dynamically using feedback from program execution, thereby focusing optimization on the most frequently traversed paths. This dynamic approach, combined with the algorithm’s inherently low compile time, makes runtime optimization a practical and cost-effective possibility, maximizing the likelihood of performance improvements. This adaptive strategy aligns directly with the goal of enhancing the efficiency and effectiveness of Modulo Scheduling.

II.Swing Modulo Scheduling Algorithm Enhancements for Superblocks

The core contribution is an enhanced Swing Modulo Scheduling algorithm for superblocks. This involves adapting the original algorithm to handle multiple basic blocks while prioritizing the most frequently executed path (hot path) within the loop. The algorithm considers resource constraints, data dependencies, and aims to minimize the Minimum Initiation Interval (MII) to maximize ILP. The implementation targets the SPARC V9 architecture, but the approach is adaptable to other architectures, such as IA64, which offer features like a large register file and speculative execution that can further improve performance gains from modulo scheduling.

1. Extending Swing Modulo Scheduling to Handle Superblocks

The core enhancement detailed in this section is the extension of the Swing Modulo Scheduling (SMS) algorithm to accommodate superblocks. A superblock is defined as a control-flow structure consisting of multiple basic blocks with a single entry point and multiple exit points. This contrasts with the original SMS algorithm's limitation to single basic block loops. This extension is significant because it allows for the application of SMS to a broader range of loop structures, thereby improving its overall applicability and potentially boosting performance. The key benefit is the ability to optimize loops that extend beyond a single basic block. The document clearly states that extending SMS to support superblocks "increases the number of loops Swing Modulo Scheduling can be applied to." This expansion of capabilities is a central contribution of the research presented.

2. The Significance of Superblocks for Instruction Level Parallelism ILP

The adoption of superblocks directly addresses the previously identified limitation of SMS in fully exploiting Instruction Level Parallelism (ILP). By allowing the algorithm to operate across multiple basic blocks, it opens up new possibilities for instruction reordering and parallel execution. This is crucial because many real-world loops exhibit control flow complexities that extend beyond single basic blocks. The limitation of previous algorithms to single basic blocks caused missed opportunities for ILP, directly impacting performance. The use of superblocks, however, allows the algorithm to address this limitation by improving the potential for instruction reordering and increasing the opportunities for exploiting ILP. This enhanced capability is expected to translate into notable performance gains, especially in loops with intricate control flow and potential for parallel execution.

3. Handling Hot Paths within Superblocks

A further refinement involves the capability to handle hot paths within superblocks. Hot paths represent the most frequently executed sequences of instructions inside a loop, as determined through profiling. By focusing optimization efforts on these critical sections, overall performance improvement is significantly enhanced. The ability to handle hot paths is crucial because it allows for a more targeted and effective optimization strategy. The algorithm's ability to operate on superblocks, combined with its capacity to focus on the most frequently executed paths (hot paths), significantly increases its efficiency and the potential for performance gains. The combination of superblock handling and hot path optimization provides a powerful approach to enhance the effectiveness of Swing Modulo Scheduling. The document explicitly notes that optimizing hot paths "increases the probability of performance gains since the loop only contains the most frequently executed path."

4. Architectural Considerations and Future Directions

The discussion also touches upon architectural considerations and future research directions. The thesis highlights that the proposed modifications to Swing Modulo Scheduling (SMS) are designed to be adaptable to various architectures. While the current implementation targets the SPARC V9 architecture, there is a stated desire to extend its applicability to other platforms, such as IA64. The suitability of IA64 is attributed to its abundant registers (reducing register spills), long latency instructions which don't block other instructions, and support for speculative execution. These features would allow SMS to operate with greater freedom and potentially achieve even greater performance enhancements. The possibility of extending the algorithm to other architectures and incorporating profile-guided or runtime optimization techniques, shows the potential for continued improvement and wider application. The focus on architecture-specific adaptations and runtime optimization capabilities shows the algorithm's flexibility and potential for future development.

III.Minimum Initiation Interval MII Calculation and Node Ordering

Effective modulo scheduling relies on calculating the MII, representing the minimum time between loop iterations. The algorithm computes MII considering both resource and recurrence constraints. A key aspect is node ordering within the Data Dependence Graph (DDG). The algorithm prioritizes instructions on critical paths while minimizing register pressure to reduce live ranges and avoid costly register spills. This sophisticated ordering strategy is crucial for achieving optimal schedules in terms of ILP and execution time.

1. Minimum Initiation Interval MII Calculation

The Minimum Initiation Interval (MII) is a crucial parameter in Modulo Scheduling, representing the minimum number of cycles between the initiation of successive loop iterations. The MII is determined by analyzing resource constraints and data dependencies within the Data Dependence Graph (DDG). Resource constraints arise when instructions require the same processing resources (e.g., functional units), leading to delays if instructions are scheduled too closely together. Data dependencies, on the other hand, dictate the order in which instructions must be executed due to data flow requirements. The algorithm computes MII by examining resource and recurrence constraints. The algorithm uses the MII as its starting value for II (Initiation Interval) when generating a schedule, selecting the lowest achievable value given resource and dependency limitations. This careful computation of MII ensures that the generated schedule respects both resource availability and data dependencies, which is essential for the correctness and efficiency of the scheduling process.

2. Node Ordering within the Data Dependence Graph DDG

Effective instruction scheduling necessitates a well-defined node ordering within the Data Dependence Graph (DDG). This ordering determines the priority in which instructions are processed during scheduling. The algorithm uses a sophisticated approach to node ordering to optimize the schedule, balancing the need to prioritize instructions on critical paths with the goal of minimizing register pressure. Prioritizing critical path instructions ensures that the overall loop execution time is reduced. Simultaneously, minimizing register pressure prevents excessive register spills, which can severely impact performance by causing memory accesses. This ordering process considers node properties such as latency, dependence distances, and the criticality of the path the node belongs to. Heuristics such as minimizing 'lstart' (total latency of successors) and 'mobility' (difference between earliest and latest start times) are used to guide the instruction scheduling. The optimal node ordering is critical for achieving schedules that balance speed and efficient register utilization.

3. Handling Recurrences and Strongly Connected Components SCCs

The presence of cycles (recurrences) in the DDG complicates the node ordering process. Recurrences represent dependencies between instructions across different iterations of the loop. For graphs with strongly connected components (SCCs) containing numerous edges, computing recurrences can become computationally expensive. To address this, the algorithm uses a heuristic approach: If an SCC has an excessive number of edges, it is treated as a single recurrence. This approximation simplifies the computation while still preserving the correctness of the scheduling process, although it might lead to a slightly less optimal MII. The RecMII for the SCC is calculated by dividing the total latency of all nodes in the SCC by the sum of all dependence distances. This handling of recurrences and SCCs is essential for maintaining computational feasibility while ensuring the algorithm produces a valid and reasonably optimal schedule.

4. The Role of Node Attributes in Optimizing Schedules

Various node attributes are calculated from the DDG to guide the instruction scheduling process. These attributes help to prioritize instructions based on their criticality and to minimize register pressure. The attributes involved include the latency of an instruction, the distance of dependence edges, the set of predecessors and successors for each node, and the Minimum Initiation Interval (MII). The algorithm leverages these attributes to strategically order instructions for scheduling. This includes prioritizing instructions on the most critical paths and placing instructions close to their predecessors and successors to decrease live ranges and minimize register pressure. The specific calculation of these node properties ensures that the resulting schedule will be both efficient and accurate, maximizing the algorithm's effectiveness in reducing execution time and minimizing register usage.

IV.Experimental Results and Performance Analysis

Experiments using SPEC benchmarks demonstrate the effectiveness of the extended SMS algorithm. While some benchmarks showed significant performance gains (10-33%), others exhibited smaller improvements or even slight performance degradation. Factors influencing performance include architectural limitations (SPARC V9's lack of speculative execution), the presence of short loops offsetting gains from longer ones, and the potential for increased register pressure leading to register spills. Future work could explore heuristics, such as iteration count thresholds, to selectively apply modulo scheduling to appropriate loops.

1. Benchmark Selection and Experimental Setup

The experimental evaluation involved a diverse set of benchmarks sourced from various suites including SPECINT 2000, SPECFP 2000, SPECINT 95, SPECFP 95, PtrDist, FreeBench, Prolangs-C, and additional programs like fpgrowth, sgefa, make dparser, hexxagon, optimizer-eval, and bigfib. These benchmarks were chosen to encompass a range of computational characteristics and loop structures, providing a comprehensive testbed for the extended Swing Modulo Scheduling (SMS) algorithm. All benchmarks were compiled and run on a 1GHz Ultra SPARC IIIi processor system. Execution times were obtained by running each executable three times and using the minimum time (user + system) as the final result, ensuring reliability and minimizing the effect of random fluctuations. This rigorous approach ensures the validity and robustness of the experimental findings.

2. Performance Gains and Architectural Influences

The experimental results demonstrated varied performance improvements across different benchmarks. Some benchmarks showed substantial speedups (10-33%), notably 107.mgrid, bigfib, and make dparser. Others displayed more modest gains (1-9%), including 172.mgrid, 168.wupwise, and fpgrowth. The performance gains were influenced by several factors. Long-running loops, as observed in 107.mgrid (achieving 33% speedup), were shown to offset the overhead associated with prologue and epilogue code. The presence of floating-point instructions with long latencies (4-7 cycles) also contributed to ILP exploitation and speedups in some benchmarks. Conversely, the lack of performance improvements in other benchmarks highlights the limitations imposed by architectural constraints. The SPARC V9 architecture’s absence of speculative execution and non-trapping instructions limited the scope for instruction reordering and performance optimization, leading to inconsistent results. This points to the architectural dependence of performance gains from modulo scheduling.

3. Analysis of Register Pressure and Loop Characteristics

The experiments also shed light on the impact of register pressure and loop characteristics on the overall performance. Loops containing more than 100 instructions (Large loops) were more prone to register spills, significantly degrading performance. The increased register pressure was primarily attributed to the nature of Modulo Scheduling, which can substantially increase the number of live values in a program. However, SMS was observed to mitigate register pressure more effectively than other Modulo Scheduling algorithms. Furthermore, the loop's trip count (number of iterations) needing to be loop-invariant (not computed within the loop itself) presented another constraint. The algorithm's inability to handle non-invariant loop trip counts prevented modulo scheduling in some cases. The analysis of these factors—register pressure, loop size, and loop invariance—is crucial in understanding the performance variability across different benchmarks and for guiding future improvements to the algorithm.

4. Interpreting II Ratios and Identifying Areas for Improvement

The analysis includes evaluating the Initiation Interval (II) ratio—the ratio of actual II to theoretical II. A ratio of 1.0 indicates that the algorithm achieved the theoretically optimal schedule given the resource and recurrence constraints. While most benchmarks achieved a ratio of 1.0, others exhibited higher II ratios (25-55% higher for 172.mgrid, sgefa, and neural). These deviations highlight potential areas for algorithm improvement. Non-optimal node ordering and the approximations used in calculating RecMII (Recurrence Minimum Initiation Interval), particularly in cases with computationally expensive circuit finding algorithms, were identified as potential causes. These insights suggest avenues for enhancing the algorithm's ability to consistently achieve theoretically optimal schedules and further improve performance in such cases. The data-driven analysis of these II ratios points to the intricacies of instruction scheduling and the necessity for continued refinement of the algorithm.

V.Comparison with Other Modulo Scheduling Algorithms

The thesis briefly discusses other modulo scheduling algorithms like Iterative Modulo Scheduling (IMS) and Slack Modulo Scheduling, highlighting their similarities and differences compared to the proposed extension of SMS. These algorithms handle multiple basic blocks (like Hierarchical Reduction and Enhanced Modulo Scheduling), but may not be as efficient in handling loops with highly skewed execution paths. SMS, especially with the superblock extension, shows advantages in balancing optimal schedule generation with low register pressure and efficient compile time.

1. Swing Modulo Scheduling SMS Compared to Iterative Modulo Scheduling IMS and Slack Modulo Scheduling

The document compares SMS to two other Modulo Scheduling algorithms: IMS (Iterative Modulo Scheduling) and Slack Modulo Scheduling. While all three aim to minimize the time between loop iterations, they differ in their approaches. IMS uses simple extensions to acyclic list scheduling and a height-based priority function, employing backtracking (unscheduling instructions) to resolve conflicts. Slack Modulo Scheduling is a bidirectional strategy that considers resource and recurrence constraints, register pressure, and critical paths, also using backtracking. In contrast, SMS avoids backtracking, scheduling instructions only once. A study cited in the document shows that SMS generally outperforms Slack, especially on architectures of low to medium complexity, while IMS performs better than Slack on higher complexity architectures. SMS is noted for its ability to find optimal schedules while keeping register pressure low and maintaining computational efficiency, a key advantage over other algorithms.

2. Addressing Limitations of Global Modulo Scheduling Techniques

The comparison also highlights the limitations of existing global Modulo Scheduling techniques, particularly Hierarchical Reduction and Enhanced Modulo Scheduling (EMS). Both techniques handle multiple basic blocks but consider resource and dependence constraints across all paths within a loop. In loops with skewed execution paths (where one path is significantly more frequent than others), this approach may lead to suboptimal schedules. This is where the extended SMS algorithm, focused on superblocks, offers a potential improvement by prioritizing the most frequently executed path (the 'hot' path). This targeted optimization avoids the inefficiencies that can arise when considering all paths equally in loops with significant execution path skewness. This contrast underscores the benefits of focusing optimization on the critical execution paths within the loop for improved performance.

3. SMS s Unique Characteristics and Advantages

The document emphasizes SMS's unique characteristics that contribute to its superior performance. Unlike IMS and Slack Modulo Scheduling, SMS avoids backtracking. This streamlined approach simplifies the algorithm and contributes to its efficiency. Furthermore, SMS's instruction ordering strategy considers both the RecMII (Recurrence Minimum Initiation Interval) of the recurrence an instruction belongs to and the criticality of the path in the Data Dependence Graph. This intelligent ordering technique aims to reduce the stage count and achieve a schedule of length MII, minimizing execution time. This is a significant advantage over other algorithms. By avoiding backtracking and implementing a more sophisticated instruction ordering technique, SMS demonstrates better performance in achieving optimal schedules while keeping register pressure low. This combination of efficiency and optimality makes SMS a compelling approach for modulo scheduling, outperforming other methods in terms of speed and register utilization.