
Parallel Job Scheduling: Theory & Practice
Document information
Author | Dror G. Feitelson |
School | The Hebrew University, University Dortmund, University of Toronto |
Major | Computer Science |
Company | MRJ, Inc. |
Place | Jerusalem, Dortmund, Toronto, Moett Field, CA |
Document type | Research Paper |
Language | English |
Format | |
Size | 300.98 KB |
Summary
I.The Challenge of Parallel Job Scheduling in High Performance Computing
This research paper addresses the critical issue of parallel job scheduling in high-performance computing (HPC) environments, particularly on supercomputers with tens to thousands of processors. A key challenge is the divergence between theoretical research and practical implementation. While theoretical models often rely on simplifying assumptions, real-world supercomputer scheduling systems require robust solutions for managing complex workloads with varying job characteristics. The paper explores this gap and proposes improvements to bridge the theory-practice divide, focusing on enhancing scheduling optimization and resource management.
1. The Gap Between Theory and Practice in Parallel Job Scheduling
The document begins by highlighting a significant disconnect between theoretical research and practical application in the field of parallel job scheduling on supercomputers. While theoretical studies often propose innovative scheduling algorithms, their implementation in real-world systems faces challenges due to simplifying assumptions made in the theoretical models. These assumptions, necessary for making theoretical proofs tractable, often deviate significantly from the complexities of actual supercomputing environments. The result is a gap between promising theoretical results and their practical applicability. This discrepancy makes it difficult to directly translate theoretical findings into effective solutions for real-world parallel job scheduling problems. The paper aims to address this challenge by examining the practical considerations that hinder the adoption of theoretically sound scheduling techniques, ultimately advocating for a more harmonious convergence between theoretical advancements and practical implementations.
2. Current Practices and Limitations of Parallel Job Scheduling Systems
Current parallel job scheduling systems, while incorporating some innovations like non-FCFS (First-Come, First-Served) service and swapping capabilities, largely maintain a traditional framework. This conservative approach limits flexibility in scheduler actions and how parallel programs utilize resources. Many studies demonstrate that increased flexibility, allowing schedulers to dynamically adapt to changing system loads and job characteristics, leads to better performance. The systems frequently divide resources into partitions of varying processor counts. Typically, one partition handles interactive tasks via time-slicing, another manages system services (like a parallel file system), and the remaining resources serve parallel jobs. Partition configuration might be adjusted regularly (e.g., larger partitions for parallel jobs overnight). Despite the potential for optimization, the prevailing trend is to standardize existing systems rather than radically altering them to increase flexibility and efficiency, thereby potentially limiting the full performance potential of these systems.
3. Workload Characteristics and their Impact on Scheduling Strategies
The document emphasizes the importance of understanding workload characteristics to develop effective scheduling strategies. Measurements from various supercomputer installations reveal extreme variability in job processing requirements; most jobs complete in seconds, but some run for hours. The coefficient of variation (CV) of service times, ranging from 4 to 70 across several centers, highlights this variability. This significant variation necessitates mechanisms to prevent long jobs from excessively delaying short jobs. The paper discusses how this understanding of workload characteristics influences choices in scheduling policies, algorithms and implementations. The highly variable nature of service times in real-world HPC environments necessitates innovative scheduling algorithms that can dynamically adapt to these fluctuations and minimize the negative impacts on overall system performance and user satisfaction. The paper explores various strategies, including preemption and job prioritization, to address this challenge.
4. Job Types and their Influence on Scheduler Performance
The document categorizes jobs into four types based on their flexibility: rigid, moldable, malleable, and evolving. Rigid jobs maintain a fixed processor allocation throughout their execution. Moldable jobs allow the scheduler to select the processor allocation within a given range, improving efficiency. Malleable jobs can have their processor allocation dynamically changed during execution. Evolving jobs exhibit varying parallelism levels across different phases. The application's design greatly influences its category. Applications designed to be moldable or malleable offer greater opportunities for optimization and higher efficiency. Schedulers can start moldable or malleable jobs sooner and delay the onset of saturation compared to rigid jobs, leading to significant performance advantages. Understanding these job types and tailoring scheduling strategies accordingly is crucial for optimal performance in parallel computing environments.
II.Analyzing Job Characteristics and Scheduling Metrics
The study examines various job characteristics including moldable jobs, malleable jobs, and rigid jobs, emphasizing how application design impacts scheduling performance. Key performance metrics like makespan, weighted completion time, response time, and throughput are analyzed. The paper highlights the high variability of job execution times in real-world HPC workloads and its impact on scheduling policies. The coefficient of variation (CV) of service times, often in the range of 4-70, underscores the need for mechanisms to prevent short jobs from being delayed by long jobs. Understanding these metrics is crucial for effective scheduling optimization.
1. Key Scheduling Metrics Makespan Weighted Completion Time and Response Time
The paper analyzes several crucial scheduling metrics. Makespan, representing the total time to complete all jobs, and weighted completion time, considering both job completion time and their relative importance, are discussed. The document also gives significant attention to response time, defined as the time from job submission to completion. The authors note that response time and flow time are often used interchangeably. The origins of these criteria can be traced back to the 1950s, with Smith's 1956 work showing how to minimize the sum of weighted completion times on a single processor by scheduling tasks based on their execution time to weight ratio (the Smith ratio). For unit weights, this simplifies to the well-known shortest-job-first method. While these metrics provide relatively simple algorithm evaluations and contribute to their popularity, subtle differences exist. An optimal weighted completion time schedule also implies optimal weighted flow time; however, this equality breaks down when schedules deviate from the optimum by even a constant factor, as demonstrated by Kellerer et al. [42] and Leonardi and Raz [47].
2. Workload Characteristics and their Impact on Scheduling Performance
Workload characteristics significantly influence scheduling performance. The paper cites studies showing extreme variability in processing requirements, with most jobs completing within seconds but a few requiring many hours. The coefficient of variation (CV) of service times is consistently high (4 to 70) across various high-performance computing centers [8, 64, 21]. This high variability necessitates effective mechanisms to prevent short jobs from being delayed by longer ones. The observed variability in job lengths has significant implications for scheduling algorithms and their effectiveness. The large range of job execution times, combined with the relatively high coefficient of variation, requires strategies designed to protect shorter jobs from being delayed indefinitely. The paper explores various approaches aimed at addressing this challenge and improving the overall fairness and efficiency of the scheduling process, such as preemptive strategies and job prioritization schemes.
3. Job Classification Rigid Moldable Malleable and Evolving Jobs
The document introduces a classification of jobs based on their flexibility regarding processor allocation. Rigid jobs require a fixed processor allocation throughout their execution. Moldable jobs allow the scheduler to choose the processor allocation within a specified range, enhancing efficiency. Malleable jobs allow dynamic processor allocation changes during execution. Finally, evolving jobs change their parallelism requirements across different execution phases. The manner in which an application is written directly impacts its classification and the scheduler's ability to optimize its execution. The effort invested in designing applications to be moldable or malleable directly improves scheduler performance, resulting in better overall system utilization and reduced response times. The paper discusses strategies for creating moldable and malleable applications, highlighting the trade-offs between application development effort and performance gains.
4. Using Job Knowledge to Improve Scheduling Decisions
The paper explores leveraging job characteristics to enhance scheduling decisions. Observing repeated application runs exhibiting similar resource consumption patterns [21, 37] suggests the potential for predicting resource needs. Classifying jobs based on user ID, execution script, and requested parallelism helps reduce execution time variability within job classes [30]. This allows schedulers to make better decisions without requiring explicit user cooperation. The information gathered from past executions can effectively inform scheduling algorithms. Utilizing this information through queue selection—implicitly providing information on processor needs, execution time limits, and other parameters—is also highlighted. A scheduler can leverage this class knowledge to improve its performance, even if it is restricted to a small database of past execution characteristics. The paper indicates that response times can be significantly improved by effectively utilizing the inherent information associated with each job submission and execution history.
III.The Importance of Preemption and Job Prioritization in Parallel Job Scheduling
The paper strongly advocates for incorporating preemption in parallel job scheduling to improve overall performance. Preemption, often implemented through gang scheduling or rate-equivalent scheduling, enables better handling of short jobs and reduces fragmentation. The analysis shows that even infrequent preemption can significantly boost system utilization. Prioritizing jobs based on their likelihood of quick completion, as supported by Shortest-Job First (SJF) principles, is highlighted as crucial for minimizing average response time. Efficient strategies for managing time slicing and space slicing are also considered. The paper emphasizes that proper implementation of preemption requires coordinated actions across all nodes running a job.
1. The Case for Preemption in Parallel Job Scheduling
The paper strongly advocates for the inclusion of preemption in parallel job scheduling despite the associated overhead. Numerous studies show that the flexibility gained from preemption significantly improves scheduling outcomes. A primary advantage is that time slicing, a common implementation of preemption, prioritizes short-running jobs, approximating the Shortest-Job-First policy. This leads to reduced average response times, especially beneficial with highly variable workloads typical of production systems. Parsons and Sevcik [64] demonstrate the importance of preemption under high variance by comparing scheduling policies with and without preemption. Furthermore, preemption supports interactive feedback by ensuring that short jobs are handled promptly. The argument for preemption is not only based on theoretical analysis but also on the observation that many existing systems utilize gang scheduling—a form of preemption—to provide good response times to short-running interactive jobs. The mechanisms for preemption involve stopping all a job's threads in a consistent state, preserving each thread's state, and potentially removing the job's data from memory. The paper notes that preemption is also valuable in mitigating fragmentation and improving resource utilization by allowing the system to recover from suboptimal allocation decisions.
2. Gang Scheduling and Rate Equivalent Scheduling
Time slicing is often implemented using gang scheduling, where all threads of a job are scheduled and descheduled simultaneously. Feitelson and Rudolph [24] find gang scheduling superior to local scheduling. Variations, such as cyclical service among fixed partition configurations [85, 94], benefit long jobs but may increase response times for short jobs. More flexible time slicing across multiple active partition sets [23, 36] is also considered. Lee et al. [45] study the interaction between gang scheduling and I/O, highlighting that while many jobs tolerate I/O perturbations, I/O-bound jobs may suffer, suggesting flexible gang scheduling. Rate-equivalent scheduling, where all threads get equal wall-clock time but not necessarily simultaneously, is another approach that is beneficial to parallel jobs. The effectiveness of gang scheduling and rate-equivalent scheduling highlights the importance of coordinated preemption across all nodes involved in executing a job. Implementing effective coordinated preemption requires the development of appropriate system-level support to ensure consistency and avoid data corruption.
3. Job Prioritization and Reducing Response Times
The paper stresses the importance of job prioritization for improving average response times. Prioritizing jobs most likely to complete quickly, combined with preemption, significantly reduces response time. Round Robin (RR) scheduling, with its inherent preemption, offers protection against highly variable service time distributions by making average response time less sensitive to the distribution's variance (assuming negligible preemption overhead). Feedback (FB) disciplines can further reduce average response times in highly variable environments [10]. The authors argue that because users cannot predict system load at job submission, they should specify a range of acceptable processor allocations, leaving the final decision to the scheduler. Observing queue occupancy or load prediction helps inform the scheduler's choice. The overall goal is to prevent short jobs from being unduly delayed by longer jobs. The efficient use of preemption and prioritization strategies is crucial for maintaining responsiveness and ensuring fair resource allocation in a parallel processing system with high variability in job characteristics. By intelligently managing preemptions and prioritization, the scheduling system can effectively balance the needs of both short and long jobs.
IV.Exploiting Job Knowledge for Improved Scheduling
The effectiveness of using prior knowledge of job characteristics for improved scheduling is explored. The study demonstrates that utilizing information about past execution times, resource demands (e.g., through queue selection), and job classes significantly reduces average response times. This includes considering the use of Least Work First (LWF) disciplines and leveraging the information obtained from execution statistics. The benefits of using this knowledge, even in the form of approximate estimations provided by users, are demonstrated. Strategies for managing these job class queue systems, including techniques like backfilling, are discussed.
1. Leveraging Past Execution Data for Improved Scheduling
The paper explores the significant potential of using past execution data to improve scheduling decisions. Observations by Feitelson and Nitzberg [21] and Hotovy [37], among others, show that repeated runs of the same application often exhibit similar resource consumption patterns. This suggests that historical data can provide valuable insights for predicting future resource requirements and optimizing scheduling decisions. Gibbons [30] further supports this by analyzing workload data from three different sites, demonstrating that classifying jobs based on user, execution script, and requested parallelism creates job classes with significantly lower execution time variability than the overall workload. The implication is that accurate estimations of resource requirements can be achieved without directly involving users in the prediction process. This approach allows the scheduler to utilize historical data to anticipate resource needs and make more informed scheduling decisions, leading to improved overall performance and efficiency.
2. Using Job Class Information and Queue Systems
Many systems assign jobs to queues based on characteristics like processor needs, execution time limits, and other parameters. Each queue, therefore, represents a job class. This information, though used implicitly in some systems through queue-partition assignments and priority settings, can be leveraged explicitly for better scheduling. The study advocates for tracking execution time and speedup knowledge on a class-by-class basis. All identifying characteristics associated with a job submission (user ID, executable file, specified memory size, etc.) can be used to assign jobs to their appropriate classes. This facilitates the estimation of job efficiency [59] or execution time [31] based on statistics from previously executed jobs in the same or similar classes. The approach of utilizing job classes via queue systems allows schedulers to implicitly incorporate information about job characteristics into their decision-making process. The potential for significant improvements in scheduling efficiency and performance by leveraging this inherent class-based information is evident.
3. Limitations of User Supplied Estimates and the Need for Robust Policies
While users often provide information about their jobs' resource requirements (e.g., through queue selection), these estimates are not always reliable. The paper notes that user-supplied estimates, though often positively correlated with reality, are notoriously inaccurate. This underscores the need for scheduling policies that can handle both accurate and inaccurate user estimations. The document suggests that such policies should penalize users who intentionally misrepresent their job characteristics. This emphasizes the importance of designing robust scheduling policies that can effectively manage the uncertainty introduced by varying degrees of accuracy in user-provided job estimations. The development of more sophisticated mechanisms, capable of accurately estimating job characteristics even in the face of inaccurate user input, is therefore crucial. These mechanisms would ideally allow schedulers to generate more effective schedules and allocate resources more efficiently, leading to overall improvements in system performance.
4. SMART Scheduling and its Practical Applicability
The paper discusses SMART scheduling [89], an offline, non-preemptive algorithm for parallel job scheduling that aims to optimize completion time. While theoretically achieving an approximation factor of 8 [72] with a worst-case deviation of 4.5, its application to real-world job traces from the Intel Paragon at the San Diego Supercomputing Center resulted in a much lower average deviation from the optimum (around 2), further improved to 1.4 using list scheduling [33]. However, the algorithm's offline nature and reliance on prior knowledge of all job execution times limit its practical applicability in dynamic environments. The discussion of SMART scheduling illustrates the complexities involved in translating theoretical scheduling algorithms into practically usable tools in real-world supercomputing settings. The limitations of this particular algorithm highlight the challenges in balancing theoretical optimality with the requirements of practical system implementation.
V.System Design and Standardization for Parallel Job Scheduling The PSCHED API
Addressing the need for standardization, the paper introduces the PSCHED API, designed to facilitate interoperability between different components of a scheduling system. This API aims to standardize interfaces for schedulers, task managers, and resource managers without dictating specific implementations. The goal is to enable schedulers to manage various parallel jobs (MPI-2, PVM, SMP multi-tasking jobs) across diverse machine architectures. This initiative builds upon existing systems like LoadLeveler and NQS (Network Queueing System), and the POSIX Batch Environment amendment of 1987. The NAS (Numerical Aerospace Simulation facility) plays a key role in the development of this API. The initiative highlights the importance of well-defined APIs for managing the complexities of modern high-performance computing (HPC) environments.
1. The Need for Standardization in Parallel Job Scheduling
The document highlights the lack of standardization in parallel job scheduling systems as a significant obstacle to progress. While some innovations like non-FCFS service and swapping have been introduced, the general trend is toward maintaining existing frameworks and formalizing them as standards. This approach, while seemingly promoting stability and compatibility, often limits flexibility, hindering the development and implementation of more efficient scheduling algorithms. Many studies show that enhanced flexibility in both scheduler actions and program utilization of parallelism significantly improves performance. The current state, with various systems operating under different approaches and interfaces, makes it difficult to compare and contrast different scheduling techniques effectively. A consistent standard would also aid in the development of more sophisticated and adaptable scheduling systems.
2. Introduction of the PSCHED API
The paper proposes the PSCHED API as a solution to the lack of standardization. Developed by an informal group involving NAS (Numerical Aerospace Simulation facility), several NASA centers, IBM, Pratt & Whitney, Platform Computing, and others, the API aims to provide standard interfaces among the components of a job and resource management system, not to standardize implementations. This allows for greater interoperability between different modules like the scheduler, task manager, and resource manager. The goal is to create a minimal set of standard functions for each module that enables scheduling a variety of parallel jobs—MPI-2, PVM, and SMP multi-tasking jobs—on diverse machines. The PSCHED API is designed to address the practical need for standardization while also acknowledging the diversity of existing parallel job scheduling systems and application designs. Its emphasis on interface standardization rather than implementation details promotes flexibility and fosters innovation while also providing a more organized and unified approach to parallel job scheduling.
3. Historical Context NQS and the POSIX Batch Environment
The development of the PSCHED API is placed within a historical context, referencing the evolution of job scheduling standards. The Network Queueing System (NQS) gained strong acceptance among hardware providers and customers, leading to its selection as the basis for the POSIX Batch Environment amendment in 1987. This highlights that standards are influenced by industry adoption. The amendment, officially approved in 1994 as IEEE POSIX 1003.2d, initially postponed addressing issues like programmatic interfaces and resource control, resulting in the inactivity of the supercomputing working group thereafter. The PSCHED API project is presented as an effort to address the issues left unaddressed by the POSIX Batch Environment amendment. The historical context demonstrates both the need for standards and the complexities involved in creating and implementing such standards in the face of evolving technology and competing interests within the HPC field. The PSCHED API seeks to build upon the successes of previous standardization efforts while also avoiding the pitfalls that led to limitations in those previous initiatives.
VI.Conclusion and Recommendations for Optimal Parallel Job Scheduling
The paper concludes by reiterating the importance of preemption, effective job prioritization, and leveraging job knowledge for better scheduling optimization. Key recommendations include providing system support for coordinated preemption (e.g., through gang scheduling), prioritizing jobs likely to complete quickly, and utilizing information about job characteristics. The need for flexible processor allocation strategies, allowing schedulers to adapt to varying system loads, is also emphasized. The development of standard APIs like PSCHED is deemed essential for fostering interoperability and facilitating advancements in the field of parallel job scheduling.
1. Summary of Findings and the Theory Practice Gap
The paper concludes by reflecting on the relationship between theoretical research and practical implementation in parallel job scheduling. It acknowledges that sometimes theory leads practice, suggesting novel approaches, while at other times it lags behind, offering justifications for already established practices. The field of parallel job scheduling currently occupies a space where the exact role of theory remains unclear. The paper's analysis suggests that there is a substantial gap between the theoretical potential of advanced scheduling algorithms and their practical implementation in real-world supercomputing environments. This gap arises from the simplifying assumptions often necessary to make theoretical proofs tractable. Bridging this gap requires a focus on creating more robust and adaptable scheduling solutions that can effectively manage the complexities and variability inherent in practical HPC workloads.
2. Recommendations for System Design and Scheduling Policies
The paper offers several key recommendations for improving parallel job scheduling. Firstly, it strongly advocates for system support for parallel job preemption, emphasizing its importance for achieving optimal performance. This should include mechanisms for coordinating preemption across nodes, such as gang scheduling or rate-equivalent scheduling. Secondly, prioritizing jobs most likely to complete quickly is recommended for enhanced average response time, employing preemption when needed. The authors highlight that round robin (RR) scheduling, with its built-in preemption, helps manage highly variable service time distributions. Thirdly, making use of available information about job characteristics—whether directly provided, measured, or remembered—is emphasized. User-supplied information, even if not entirely accurate, is valuable for enhancing scheduling decisions. Fourthly, the paper argues for flexible processor allocation, allowing the scheduler to adjust to varying system loads. This involves users providing acceptable processor ranges, leaving the precise allocation to the scheduler's discretion. The recommendations are presented as actionable steps toward building more efficient and responsive parallel job scheduling systems.
3. The Role of the PSCHED API in Standardization
The paper concludes by reinforcing the importance of the PSCHED API for standardization in parallel job scheduling. The API is presented as a critical step towards addressing the fragmentation and lack of interoperability amongst different parallel job scheduling systems. By focusing on interface standardization, rather than dictating specific implementations, PSCHED allows for flexibility and innovation while promoting consistency across different systems and applications. The aim is to create a common framework that can support the diverse needs of modern parallel computing environments. This is discussed in the context of the POSIX Batch Environment and the NQS system, highlighting the ongoing need for improved standards that promote interoperability and efficiency in managing complex parallel workloads. The PSCHED API emerges as a significant contribution towards achieving a more unified and effective approach to parallel job scheduling in the high-performance computing landscape.