Conclusions and Future Work

Coscheduling Techniques in Cluster Computing

Document information

Language English
Format | PDF
Size 176.34 KB
Major Computer Science
Document type Thesis

Summary

I.Conclusions on Explicit Control and Implicit Control Coscheduling Techniques for Distributed Applications in Cluster Computing

This research presents and evaluates two novel coscheduling techniques for improving the performance of distributed applications running alongside local workloads in cluster computing environments. One technique employs an explicit-control approach, offering performance guarantees akin to gang scheduling by dedicating resources to both distributed and local tasks for specific intervals. The other utilizes an implicit-control method, aiming to minimize overhead by suspending processes only when the cost of active waiting exceeds context switching. Experiments conducted on a real PVM-LINUX cluster demonstrated that the explicit method, particularly with its 'BALANCED' model, effectively enhances performance for message-passing-intensive distributed applications, while synchronization overhead is detrimental for CPU-bound applications. The implicit method shows limited gains, primarily in low-load situations. The explicit approach offers the advantage of mode selection, tailoring resource allocation to application needs. However, centralized control raises concerns regarding fault tolerance.

1. Explicit Control Coscheduling Design Implementation and Performance Analysis

An explicit-control coscheduling mechanism, implemented on a real cluster, was designed to guarantee distributed application performance. This was achieved by allocating separate execution intervals for distributed and local applications. The system demonstrated good performance across various message-passing distributed applications, with noticeable coscheduling effects primarily during synchronization and message exchange. The research highlights the development of two models, STATIC and BALANCED, to accelerate message-passing-intensive distributed applications through synchronized parallel and local execution periods. The BALANCED model further optimized performance by adjusting periods to the average local workload. Conversely, for CPU-bound applications, synchronization proved detrimental, adding unnecessary overhead. A key advantage of this explicit scheme is its ability to adapt its operational mode, outperforming implicit-control methods. However, fault tolerance was identified as a major drawback due to the centralized nature of the control mechanism; a distributed master could increase communication overhead significantly. Active waiting in time-sharing systems for events (like blocking receives) was found to be inefficient, highlighting performance penalties within local tasks.

2. Implicit Control Coscheduling Spin Block Technique and Limitations

The study explored an implicit-control coscheduling technique, employing a spin-block mechanism. This approach suspends blocked receiving processes until the cost of active waiting surpasses context switching. While effective in minimizing the impact of waiting times, it was found that the method showed only slight performance gains under low system workloads. The additional overhead generated by the spin-block approach negatively impacted both distributed and local application performance. The implicit method did not prove as beneficial as explicit control, especially in time-sharing systems where opportunities to execute other tasks were lost. The results indicate that gains from spinning during blocking receives did not outweigh the overall overhead, making the method less efficient compared to the explicit approach.

3. Predictive Coscheduling System Level Implementation and Performance Evaluation

A novel predictive coscheduling technique was introduced, offering improved performance compared to the implicit method. In contrast to user-space implementations (like the explicit and implicit models), this method resided in system space (Linux kernel), providing transparency and independence from specific Distributed Computing Environments (DCEs) like PVM or MPI. This system-level implementation reduced the overhead associated with user-level mechanisms, enhancing the overall efficiency. The study contrasts this with a dynamic coscheduling variation, based only on receiving frequency, which performed less effectively. Extensive testing with NAS and PARKBENCH benchmarks, including synthetic applications designed to control communication parameters like message length, showcased its broad applicability to fine-grained distributed applications. The results demonstrated that the predictive model's performance advantages considerably outweighed the introduced local overhead. The additional overhead was found to be proportional to the CPU demands of the distributed applications, underscoring the effectiveness of promoting applications based on communication frequency.

4. Comparative Analysis Simulation vs. Real World Implementation and CMC Metrics

A key finding highlighted the limitations of simulation in accurately predicting performance. While simulation failed to reveal significant performance discrepancies between predictive and dynamic coscheduling models, real-world testing showed a clear advantage for the predictive model. This discrepancy underscores the need for real-world experiments to account for system complexities that simulation often overlooks, such as variations in computing power, hardware resources, operating system latency, network latency, bandwidth, and bottlenecks. The research also introduced CMC metrics, which provide real-time performance data without needing to wait for task completion, thus facilitating on-time tuning of distributed system parameters. This is in contrast to traditional metrics like speedup and efficiency that necessitate separate serial and parallel executions. The challenges of using CMC metrics for absolute performance measurements and the benefit of dynamic model switching for large-scale applications are discussed. Furthermore, issues relating to memory management in non-dedicated systems were also addressed, noting that traditional benefits of paging in uniprocessors might be diminished in distributed environments due to interactions between scheduling disciplines, application synchronization, and page reference patterns.

II.Predictive and Dynamic Coscheduling Enhancing Performance in Cluster Computing

A significant contribution is the development of a predictive coscheduling technique, implemented within the Linux kernel for enhanced transparency. This approach aims for on-time parallel performance by dynamically adjusting coscheduling parameters based on runtime information, achieving better performance than a simpler dynamic model relying solely on receiving frequency. Evaluation using NAS and PARKBENCH benchmarks showcases its effectiveness across a range of fine-grained distributed applications. Both predictive and dynamic models introduce some overhead to local tasks, but the predictive method’s gains for distributed applications significantly outweigh these penalties. The research highlights a critical difference between simulation and real-world implementation results, emphasizing the limitations of simulation in capturing the complexities of cluster computing environments.

1. Predictive Coscheduling Design Implementation and Advantages

A new predictive coscheduling technique is detailed, representing a key contribution of this research. Unlike previous user-level implementations (such as explicit and implicit methods), this approach was implemented within the Linux kernel's system space. This system-level implementation offers several advantages. Firstly, it provides transparency to the end-user. Secondly, the model's performance is independent of the Distributed Computing Environment (DCE) used, such as PVM or MPI, offering greater flexibility and portability. The predictive model uses runtime information to dynamically adjust coscheduling parameters while distributed applications are executing, enabling on-time parallel performance. This contrasts with a simpler dynamic coscheduling approach, which relies only on receiving frequency and proved less effective in identifying relevant processes. The predictive model's effectiveness was rigorously tested using various kernel benchmarks from the NAS suite (with varying messaging characteristics) and three PARKBENCH low-level benchmarks (measuring performance across barriers and communication in one and two directions). The results demonstrated that the predictive model is applicable to a broad spectrum of fine-grained distributed applications.

2. Dynamic Coscheduling Limitations and Comparison with Predictive Approach

A dynamic coscheduling method, relying solely on message receiving frequency, is presented and compared against the predictive model. This method, unlike the predictive one, does not utilize comprehensive information to identify relevant processes for scheduling. Consequently, its performance was demonstrably inferior to the predictive approach. While both the predictive and dynamic models introduced similar penalties on local tasks in terms of performance overhead, the predictive model's slight increase in local overhead was insignificant compared to the gains achieved for distributed applications. Experimental results confirmed that this overhead is directly proportional to the CPU requirements of distributed applications, and local application performance was only minimally affected by the communication overhead. This supports the claim that promoting distributed applications based on communication frequency, as done by the predictive model, constitutes an efficient coscheduling policy. The limited capabilities of the dynamic model emphasize the superior performance and design of the predictive approach.

3. Comparative Analysis Simulation vs. Real world Results and CMC Metrics

The research highlights a significant discrepancy between simulation and real-world implementation results for both predictive and dynamic coscheduling. While simulation did not reveal substantial performance differences between the two, real-world experiments clearly demonstrated the superiority of the predictive model. This underlines the limitations of simulations in accurately modeling real-world complexities and the importance of empirical evaluation. The simulation's aim was to isolate the performance of individual coscheduling policies, avoiding extraneous influences such as computing power, hardware specifications of cluster nodes, operating system latency, network latency, bandwidth, and bottlenecks. Furthermore, the study introduces CMC (Coscheduling Monitoring and Control) metrics, offering a new approach to performance evaluation. Unlike traditional metrics (such as speedup and efficiency requiring separate serial and parallel executions), CMC metrics provide real-time performance data, enabling dynamic tuning of distributed system parameters during application execution. While CMC metrics do not provide absolute performance, comparing them against other models like HPDT provides valuable insights into relative performance, particularly for large distributed applications.

III.CMC Metrics and Future Research Directions in Coscheduling

The study introduces CMC (Coscheduling Monitoring and Control) metrics, enabling real-time performance monitoring and parameter tuning of coscheduling policies. While traditional metrics like speedup and efficiency require separate serial and parallel executions, CMC metrics allow for on-the-fly analysis. Future work will focus on extending implicit-control coscheduling to incorporate memory considerations, aiming to address the challenges of dynamic memory allocation in cluster and NOW environments. This builds on existing research that minimizes the impact of job memory requirements on scheduling policies [80, 14, 78, 79, 81], acknowledging the need for memory-aware implicit-control coscheduling techniques.

1. CMC Metrics Real time Performance Monitoring and Tuning

The research introduces CMC (Coscheduling Monitoring and Control) metrics as a novel approach to performance evaluation in coscheduling. Unlike traditional metrics such as speedup (S = T1/Tn, where T1 is serial execution time and Tn is parallel execution time) and efficiency (E = S/n, where n is the number of processors), CMC metrics provide real-time performance data. This allows for on-the-fly monitoring and tuning of distributed system parameters during application execution, eliminating the need for separate serial and parallel runs to obtain performance data. The key advantage is that the distributed system parameters can be adjusted dynamically while the applications are running. A limitation is that CMC metrics do not provide absolute performance values; however, they are useful for comparing different coscheduling models against each other, particularly for sufficiently large distributed applications. The ability to switch between models dynamically using system calls, which adjust system coscheduling variables, would further enhance performance comparison.

2. Memory Management in Non Dedicated Systems and Future Research

The study acknowledges challenges associated with memory management in non-dedicated cluster systems. The dynamic behavior of local applications, affecting allocated resident memory, and distributed job mapping policies without memory considerations, may result in insufficient memory for parallel jobs during execution. This necessitates the coexistence of the local scheduler and the demand-paged virtual memory mechanism. While paging improves memory and CPU utilization in uniprocessors by allowing processes to run with only a subset of their code and data in main memory, its effectiveness is diminished in distributed environments due to the interactions between CPU scheduling disciplines, application synchronization patterns, and page reference patterns. A coscheduling environment proposed in [81] addresses local task starvation by reducing page faults in non-dedicated cluster systems, prioritizing distributed tasks with lower page-faulting probabilities. This research underscores the need to minimize the impact of memory constraints on coscheduling policies. Future work will focus on developing implicit-control coscheduling techniques that explicitly consider dynamic memory resource allocation based on the execution of local and distributed jobs.

IV.Relevant Research and Benchmark Suites

The research references numerous works on parallel processing, distributed systems, and scheduling, including contributions from researchers such as Feitelson and Rudolph on gang scheduling and runtime identification of working sets [11, 12, 13, 14]. Benchmarking involved the NAS parallel benchmarks and PARKBENCH, providing a robust evaluation of the proposed coscheduling techniques across diverse distributed application characteristics. The study also mentions PVM [30, 31, 44] and MPI [27] as relevant parallel programming environments.

1. Benchmark Suites NAS and PARKBENCH

The research utilized established benchmark suites to rigorously evaluate the performance of the proposed coscheduling techniques. The NAS Parallel Benchmarks were employed to assess performance across a range of distributed applications, allowing for scaling of applications while acknowledging limitations in tuning specific features such as communication frequency or message length. To address these limitations, the study also developed various synthetic applications allowing for precise control over message length and communication frequency, providing a more comprehensive evaluation of the predictive coscheduling mechanism's performance under varied conditions. In addition to the NAS suite, three PARKBENCH low-level benchmarks were used. These benchmarks focused on measuring performance in barriers and communication (both uni- and bi-directional), offering a complementary evaluation of the proposed methods. The combined use of these benchmark suites (NAS and PARKBENCH) provided a robust and comprehensive assessment of the coscheduling techniques across a wide range of distributed application characteristics and communication patterns, strengthening the validity and generalizability of the research findings.

2. Relevant Prior Research in Coscheduling and Parallel Processing

The research builds upon and contributes to a substantial body of existing work in coscheduling and parallel processing. Several key researchers and their contributions are cited, including Feitelson and Rudolph's work on gang scheduling and runtime identification of activity working sets [11, 12, 13, 14]. These works, along with others focusing on the impact of job memory requirements on gang scheduling performance [80, 78, 79] and coscheduling under memory constraints [81], provided a contextual framework for the current research. Further, the study mentions other notable works on scheduling techniques for concurrent systems [2], the Hector distributed runtime environment [7], the Condor system for hunting idle workstations [5], and the impact of operating system scheduling policies on parallel application performance [9]. References to PVM (Parallel Virtual Machine) [30, 31, 44] and MPI (Message Passing Interface) [27] highlight the relevant parallel programming environments considered within the context of the research. This extensive review of prior research establishes the research's position within the field and demonstrates its contribution to the advancement of coscheduling techniques.