Scalable Load Balancing for Parallel Computers

Updated at:24/07/2025

Document information

Author	Vipin Kumar
School	University of Minnesota, University of Central Florida
Major	Computer Science
Place	Minneapolis, MN; Orlando, Florida
Document type	Research Paper

Language	English
Format	\| PDF
Size	400.90 KB

Summary

I.Challenges in Dynamic Load Balancing for Parallel Algorithms

This research addresses the challenges of achieving optimal dynamic load balancing in parallel computing, particularly for problems where the workload is unpredictable. The inherent difficulty lies in the inability to accurately estimate the size of the total work at any given processor, making scalability analysis crucial. The study focuses on improving the efficiency of various parallel algorithms across different architectures, including hypercube, mesh, and networks of workstations, by implementing and analyzing several load balancing schemes.

1. The Fundamental Problem Inherent Difficulty in Workload Estimation

The core challenge addressed in this research is the inherent difficulty in accurately estimating the total workload at each processor in parallel computing. This uncertainty is particularly prevalent in problems where the work distribution is dynamic and irregular. The paper highlights that it is often impossible, or at least very difficult, to predict the size of the individual tasks assigned to each processor. This unpredictable workload distribution necessitates the development of sophisticated dynamic load balancing schemes, as static approaches will inevitably lead to inefficient resource utilization. The researchers aim to determine the most scalable load balancing techniques across different parallel architectures, acknowledging that the difficulty in predicting the workload size is a major obstacle to optimizing performance. The lack of a priori knowledge of workload sizes restricts the applicability of several load balancing strategies. Therefore, the research focuses on methods suitable for scenarios with this crucial uncertainty in workload distribution.

2. The Need for Dynamic Load Balancing and Scalability Analysis

The inability to precisely estimate the workload size necessitates the use of dynamic load balancing techniques. These methods distribute tasks among processors during runtime, adapting to changing workloads. The introduction mentions the development of numerous dynamic load balancing strategies, but underscores the difficulty in objectively comparing their relative effectiveness. Experimental evaluations on physical parallel architectures are hampered by the sensitivity of performance to hardware characteristics like interconnection networks, CPU speeds, and communication channel speeds. Changes in any of these parameters can significantly alter the observed performance of load balancing schemes, rendering comparisons across different experiments inconclusive. The researchers emphasize the critical importance of scalability analysis to overcome this limitation. Scalability analysis enables the extrapolation of performance observations from smaller-scale experiments to larger systems, enabling informed decisions about architecture-algorithm selection under various constraints on problem size and processor numbers. Furthermore, it allows for the prediction of performance on large systems based on results with fewer processors and the forecasting of performance changes resulting from hardware improvements.

3. Characteristics of Suitable Problems and Algorithmic Approaches

The document specifies that the proposed dynamic load balancing schemes are particularly relevant to a class of problems where the total workload can be easily partitioned into many subtasks, but the size of each subtask is highly variable and unpredictable. This characteristic is present in many algorithms, including tree search algorithms commonly used in artificial intelligence and operations research, and divide-and-conquer algorithms. The example of depth-first search in a state-space tree illustrates this perfectly: while the tree can be divided, the subtrees may differ significantly in size, making accurate work estimation impossible. This inherent uncertainty in work size is the key criterion that differentiates the problems best suited for the proposed load balancing techniques. The inability to estimate the workload size prior to execution fundamentally restricts the choice of load balancing strategies and necessitates dynamic solutions. Therefore, the research specifically addresses the challenges and opportunities presented by this particular class of parallel problems.

4. Architectures and the Goal of Near Optimal Load Balancing

The central objective of this research is to identify the most scalable load balancing schemes for several distinct parallel architectures: hypercube, mesh, and networks of workstations. For each of these architectures, the researchers aim to establish lower bounds on the scalability of any conceivable load balancing scheme, setting a benchmark for evaluating the optimality of different approaches. A significant contribution of this work is the scalability analysis of several load balancing schemes previously unexamined in the literature. This analysis offers valuable insights into how different schemes perform under varying problem and architectural parameters. The researchers aspire to determine near-optimal load balancing strategies for each architecture considered, striving for solutions that effectively distribute work while minimizing overhead and maximizing resource utilization. The study validates its theoretical findings through implementation and testing on the Ncube/2 multicomputer, specifically for the Tautology Verification problem, to verify its theoretical results for the hypercube architecture, demonstrating the accuracy and utility of its analytical framework.

II.Isoefficiency A Metric for Scalability

The research utilizes the isoefficiency metric to quantitatively evaluate the scalability of different load balancing algorithms. Isoefficiency analysis helps to determine how the problem size needs to scale with the number of processors (P) to maintain a desired level of efficiency (E). A lower bound on any isoefficiency function is linear, indicating optimal scalability. Algorithms with isoefficiencies of O(P log^cP), for small constant c, are considered reasonably optimal for practical purposes. The study establishes lower bounds on isoefficiency for each target architecture.

1. Isoefficiency as a Measure of Scalability

The core of this section is the introduction of isoefficiency as a critical metric for evaluating the scalability of parallel algorithms. The authors explain that when a fixed-size problem is solved using an increasing number of processors (P), efficiency decreases due to rising overhead. However, for many algorithms, increasing the problem size (W) while keeping P constant leads to higher efficiency, as overhead growth is slower than W. Scalable algorithms maintain efficiency with increasing P provided W also increases. The isoefficiency function, f(P), describes the relationship between P and W required to maintain efficiency E; its plot forms the isoefficiency curve. A key finding is that the lower bound of an isoefficiency function is asymptotically linear, reflecting the inherent sequential component in all problems. Linear isoefficiency denotes optimal scalability, while O(P log^cP) for a small constant c is considered reasonably optimal in practice. The concept of scalable parallel algorithms is introduced, where maintaining a desired efficiency with increasing processors requires a corresponding increase in problem size. This crucial concept is foundational to the analysis performed throughout the paper.

2. Isoefficiency Lower Bounds for Different Architectures

This subsection focuses on deriving lower bounds for isoefficiency functions on different parallel architectures. The authors reason that for a hypercube architecture, the minimum time for work propagation is log P (since the furthest processors are log P hops away), resulting in a lower bound of P log P for the isoefficiency function. Similarly, for a mesh architecture, work propagation requires √P time, leading to an isoefficiency lower bound of P^1.5. In networks of workstations, the sequential nature of message transmission across the network implies that at least P messages are needed, resulting in an execution time lower bound of P and an isoefficiency lower bound of P². These lower bounds serve as benchmarks against which the scalability of different load balancing schemes can be compared. The determination of these lower bounds demonstrates a thorough understanding of the architectural limitations that affect load balancing and performance. The theoretical groundwork laid by establishing these lower bounds forms the basis for evaluating the efficiency and scalability of the algorithms proposed later in the research.

3. Interpreting Isoefficiency Results Scalability and Speedup

This part delves into the interpretation of isoefficiency results, relating them to the practical implications of scalability and speedup. The authors explain that if the problem size (W) needs to grow exponentially with the number of processors (P) to maintain efficiency, the algorithm-architecture combination is poorly scalable, resulting in limited speedup unless the problem is enormously large. In contrast, a linear growth of W with respect to P indicates high scalability, promising significant speedup with reasonable problem size increases. The isoefficiency function, therefore, acts as a predictor of scalability. The research underscores the utility of isoefficiency analysis in guiding decisions on algorithm and architecture choices. By understanding the relationship between problem size and processor count required for achieving a specific efficiency, researchers can make informed choices for balancing computational resources with problem complexity and architecture limitations, ultimately leading to better optimization of parallel computing systems.

III.Receiver Initiated Load Balancing Schemes

Several receiver-initiated load balancing schemes are analyzed, including Asynchronous Round Robin (ARR), Global Round Robin (GRR), Random Polling (RP), and a modified Global Round Robin (GRR-M) incorporating message combining. The performance of these schemes is assessed based on communication overheads and contention for shared resources. The analysis considers the impact of variable work transfer costs on the scalability of these schemes and the effect of hardware characteristics like message startup time and per-hop time on communication overheads.

1. Asynchronous Round Robin ARR

The Asynchronous Round Robin (ARR) scheme is described as a simple and efficient load balancing strategy. Each processor independently maintains a target variable. When a processor runs out of work, it sends a request to the processor indicated by the target variable, then increments the target (modulo P). The simplicity stems from the independence of processors; each generates work requests without coordination, avoiding potential bottlenecks. The scheme's simplicity and lack of coordination make it attractive, but its scalability will depend on factors such as network congestion and the distribution of work among processors. Further analysis would be necessary to determine its isoefficiency characteristics and performance trade-offs against more complex methods.

2. Global Round Robin GRR

The Global Round Robin (GRR) scheme employs a global variable, TARGET, stored in processor 0. When a processor requires work, it obtains TARGET's value from processor 0, which then increments it (modulo P). The work request is sent to the processor whose ID matches the retrieved value from TARGET. This approach ensures even distribution of work requests, but introduces potential contention for accessing the shared TARGET variable. This contention could become a significant bottleneck as the number of processors increases, limiting scalability. The centralized nature of this approach makes it susceptible to bottlenecks, unlike the decentralized nature of ARR. While aiming for even distribution, the overhead associated with managing and accessing the global variable must be considered to fully understand the scalability and isoefficiency function of GRR.

3. Random Polling RP and its Locality Advantages

The Random Polling (RP) scheme focuses on locality of communication. When a processor exhausts its work, it sends requests to its immediate neighbors in a round-robin fashion. In hypercubes, this involves contacting only log P neighbors. For architectures with uniform processor distances (like fully connected networks), this scheme is equivalent to ARR. The advantage of RP lies in its localized communication. Work requests and data transfers remain within close proximity, reducing communication latency. However, the localized nature could also lead to slower distribution of work from heavily loaded regions to distant idle processors. The trade-off between locality of communication and overall workload distribution efficiency needs to be investigated further. For certain architectures, it would be equivalent to ARR, but for architectures like the hypercube, this localized nature might become a performance bottleneck.

4. Global Round Robin with Message Combining GRR M

GRR-M is presented as a more sophisticated approach that incorporates message combining for improved efficiency, especially suitable for hypercube architectures. The scheme embeds a spanning tree on the hypercube rooted at processor 0. Requests to read and increment TARGET travel up the tree, with intermediate nodes potentially combining requests before forwarding them to the root. The combined increment is applied to TARGET, and the result is sent back down the tree. While this reduces the number of messages, software implementation on systems lacking dedicated message combining hardware is expensive, adding overhead. The effectiveness of GRR-M relies on the balance between message combining efficiency and the added overhead associated with the spanning tree and request combination mechanisms. Its isoefficiency compared to other schemes needs careful evaluation, particularly concerning the cost of software message combining.

5. Analysis of Communication and Contention Overheads

The analysis of receiver-initiated schemes considers both communication overheads and contention for shared resources. Communication overhead is influenced by factors such as message size, network topology, and the number of messages exchanged. Contention overhead arises when multiple processors access shared resources simultaneously, such as the global variable in GRR. The analysis considers that communication time can vary depending on network traffic. A crucial assumption, that the communication time is constant under certain constraints, is made and justified. The analysis attempts to determine the dominant overhead for each scheme. This analysis forms the basis for calculating isoefficiency functions, providing a rigorous quantification of the scalability of each load balancing scheme. The impact of this assumption is further explored in a later section that examines the impact of varying message costs.

IV.Sender Initiated Load Balancing Schemes

The research also investigates sender-initiated load balancing, including Single Level (SL) and Multilevel (ML) schemes. These schemes are compared against receiver-initiated approaches, considering their performance characteristics and the suitability of various work transfer mechanisms. The analysis highlights the trade-offs between communication efficiency and the need for parameter tuning in these approaches.

1. Single Level Load Balancing SL

The Single-Level Load Balancing (SL) scheme attempts to balance the load by dividing the task into numerous subtasks, aiming for roughly equal work distribution across processors. A designated 'MANAGER' processor generates and distributes these subtasks on demand. While this approach statistically ensures a degree of load balance, the generation of subtasks by the MANAGER can become a significant bottleneck, particularly as the number of processors (P) increases. This bottleneck severely limits the scalability of the SL scheme. The performance degrades rapidly beyond a certain point because the MANAGER cannot generate subtasks fast enough to keep all other processors busy. The scalability is therefore poor, even though the method may outperform multilevel schemes with a small number of processors. This bottleneck is a major drawback limiting the applicability of SL in large-scale parallel systems.

2. Randomized Allocation and its Limitations

The Randomized Allocation Strategy, proposed by Shu and Kale, assigns newly generated subtasks to randomly chosen processors. This random allocation inherently introduces a degree of load balance. However, the document points out several practical implementation difficulties. Because each node expansion requires a communication step, efficiency is bound by the ratio of node expansion time to the combined time for expansion and communication. The method is only suitable for problems where computation per node significantly outweighs communication costs. The scheme's requirement for linearly increasing communication bandwidth with respect to P renders it impractical for large-scale systems across various architectures (hypercube, mesh, networks of workstations). Furthermore, the unbounded memory requirement, due to the potential for a processor to store many work pieces, poses a substantial challenge. This contrasts with other schemes where memory requirements remain similar to sequential implementations.

3. Multilevel Load Balancing ML and Granularity Control

The Multilevel Load Balancing (ML) scheme is presented as an alternative to address some of the limitations of the single-level approach. Similar to the randomized allocation strategy, it faces challenges related to the communication overhead associated with node expansion. This communication cost can be mitigated by implementing granularity control, where work is not distributed below a specific cutoff depth. While reducing communication overhead, this technique introduces its own complications as subtrees below the cutoff may be of widely varying sizes. Finding an optimal cutoff depth is crucial; a depth that's too shallow might leave substantial work for sequential processing, while a depth that's too deep might not reduce communication significantly. The challenges in effective granularity control for highly irregular state-space trees are discussed, indicating that the optimal performance depends heavily on careful tuning of parameters. The effectiveness of ML is therefore highly dependent on the problem characteristics and the effective selection of parameters.

4. Comparison of Sender Initiated and Receiver Initiated Schemes

The document compares sender-initiated schemes (like SL and ML) with the previously discussed receiver-initiated methods. A key distinction lies in their work transfer mechanisms. Sender-initiated approaches transfer the current state as a work unit, whereas receiver-initiated schemes often employ stack splitting and transfer. The sender-based mechanism is more efficient when the state description is small but the work stacks can become deep, and stack splitting is computationally costly. Machines with low message startup times might see stack communication costs becoming sensitive to stack size, further favoring sender-initiated methods. The choice between sender and receiver-initiated load balancing depends on problem characteristics and architecture-specific features. While sender-initiated schemes can be efficient under specific conditions, they often necessitate considerable fine-tuning and parameter adjustments, which could lead to complexity and implementation difficulties.

V.Experimental Results and Discussion Satisfiability Problem on Ncube 2

Experimental results are presented, focusing on the Satisfiability problem solved using the Davis-Putnam algorithm on an Ncube/2 multicomputer. The experiments validate the theoretical scalability analysis of the various load balancing schemes. The impact of factors like the number of processors, problem size, and hardware characteristics (such as the cut-through message routing feature of the Ncube/2) on the speedup and efficiency of different schemes are discussed. The results demonstrate the accuracy and viability of the isoefficiency framework for predicting performance across varying hardware and problem parameters. The best-performing schemes are highlighted, considering both theoretical and experimental results, comparing GRR-M, RP, and NN schemes on the Ncube/2.

1. Experimental Setup Satisfiability Problem and Ncube 2 Multicomputer

The experimental evaluation involved testing eight different load balancing schemes on a second-generation Ncube multicomputer. The chosen problem was the Boolean Satisfiability problem, which involves determining the satisfiability of Boolean formulas. This problem is solved efficiently using the Davis-Putnam algorithm, which explores a binary tree of true/false assignments to literals. The Satisfiability problem is relevant to various domains including VLSI design and theorem proving. The Ncube/2, a hypercube multicomputer with a cut-through message routing feature, was used for the experiments. The time to communicate a message is modeled as k + d*t_h, where k is a constant, d is the hop distance, and t_h is the per-hop time. This setup allowed for a rigorous assessment of the load balancing schemes in a practical, high-performance computing environment. Problem instances varied widely in size (100,000 to 10 million nodes) and depth (35 to 65 variables), ensuring a comprehensive test of scalability.

2. Results and Comparison of Load Balancing Schemes

The experimental results, presented in tables and graphs, show the speedups achieved by the eight load balancing schemes for the Satisfiability problem. The results indicate that the performance of GRR and SB schemes was very similar, aligning with their identical isoefficiency functions. The isoefficiency of O(P² log P) for GRR and SB suggests that performance degrades rapidly beyond 256 processors. Good speedups for P > 256 are only observed for very large problem instances. ARR, while more scalable than GRR and SB, was still significantly less scalable than RP, NN, or GRR-M. The differences between the schemes' performance highlight the trade-offs between simplicity, communication overhead, and contention for shared resources. The experimental data confirms the theoretical analysis presented earlier in the paper, demonstrating that isoefficiency analysis accurately predicts performance across various problem sizes and processor counts.

3. Analysis of Specific Schemes GRR M RP and NN

The discussion highlights the comparative performance of GRR-M, RP, and NN schemes. While GRR-M theoretically showed the best isoefficiency for the hypercube architecture, experimental results showed that speedups were comparable to RP. This discrepancy is attributed to the high cost of software message combining in the absence of dedicated hardware on the Ncube/2. The analysis suggests that the difference in performance would become more pronounced with a larger number of processors. Between RP and NN, RP demonstrates superior asymptotic scalability, while NN exhibited slightly better performance in the experiments due to the high quality of the splitting function and the relatively small number of processors used. The experimental results confirm that for scenarios with poorer splitting functions, RP would generally outperform NN, particularly as the number of processors increases. This highlights the importance of considering both theoretical isoefficiency and the practical impacts of work splitting function quality.

4. Impact of Technology and Architecture Extrapolating from Results

The study uses the experimental results to illustrate the predictive power of isoefficiency analysis. It shows how changes in hardware characteristics (CPU speed, communication channel speed) can be predicted. For instance, a tenfold increase in CPU speed would require a tenfold increase in problem size to maintain the same efficiency. Similarly, a tenfold increase in communication speed would allow for the same efficiency with a problem instance one-tenth the original size. The impact of these changes is shown to be moderate for the studied schemes but could be dramatic for other algorithms. The analysis extends to comparing scalability across architectures. For example, the isoefficiency of the best hypercube technique (O(P log²P)) is shown to be superior to that of a network of workstations for the examined problem. This capability of predicting performance changes under various circumstances showcases a major advantage of using isoefficiency analysis for assessing and optimizing parallel algorithms and architectures. The experiments validated the theoretical findings and highlighted the ability to extrapolate the results to larger systems and different hardware configurations.

VI.Impact of Technology and Architecture on Scalability

The study concludes by examining the impact of technological advancements and architectural choices on the scalability of parallel algorithms. Isoefficiency analysis is shown to be a powerful tool for predicting the effect of changes in CPU speed, communication channel speeds, and architecture on the performance of load balancing schemes. The comparative scalability of the hypercube architecture against networks of workstations is explored, demonstrating the versatility of isoefficiency analysis in architecture selection.

1. Predicting Performance Changes with Isoefficiency Analysis

This section highlights the use of isoefficiency analysis to predict how changes in hardware and technology impact the performance of load balancing schemes. The authors demonstrate that isoefficiency analysis allows for accurate predictions of performance changes due to alterations in CPU speed or communication channel speed. Specifically, a tenfold increase in CPU speed would necessitate a tenfold increase in problem size to maintain the same efficiency. Conversely, a tenfold improvement in communication speed would enable achieving the same efficiency with a problem size one-tenth the original. This predictability is a significant advantage of the isoefficiency framework, particularly when compared to other algorithms (like FFT and matrix algorithms) that might exhibit more drastic performance changes with similar technological shifts. The analysis demonstrates the robustness of isoefficiency in predicting changes in algorithmic speed and efficiency in the face of evolving hardware.

2. Comparing Architectural Scalability Hypercube vs. Network of Workstations

The research utilizes isoefficiency analysis to compare the scalability of different parallel architectures. An example is provided to illustrate the difference between a hypercube architecture and a network of workstations. The best-performing load-balancing scheme on a hypercube has an isoefficiency function of O(P log²P). Using this, achieving the efficiency obtained with 16 processors on a network of workstations would only require a 400-fold increase in problem size on the hypercube. Conversely, obtaining the same efficiency on a network of workstations would necessitate an increase in problem size of 10240, significantly more demanding. The analysis concludes that the hypercube architecture presents a much more scalable platform than a network of workstations for problems where workload size cannot be accurately estimated a priori. This comparative analysis showcases the utility of isoefficiency in making architectural choices and guiding the development of scalable parallel computing systems.

3. Broader Implications and Future Research Directions

The final part of this section discusses the broader applicability of the research findings and suggests directions for future work. The authors point out that different load balancing strategies are needed for problems with different characteristics, particularly concerning the communication coupling between subtasks and the ability to estimate work size. This research focuses on a specific point on the spectrum where there is no communication coupling and work size is unpredictable. The research suggests that future investigations into optimal load balancing schemes for problems with communication coupling and predictable workload sizes would be valuable. This suggests a wide spectrum of applications for various load-balancing algorithms, depending on their specific requirements. The discussion emphasizes the need to tailor load balancing strategies to the specific demands and characteristics of each problem domain and concludes by noting the need for further research into a wider range of problem characteristics.