Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Thamsen, Lauritz; Verbitskiy, Ilya; Nedelkoski, Sasho; Tran, Vinh Thuy; Meyer, Vinícius; Xavier, Miguel G.; Kao, Odej; Rose, César A. F. De

doi:10.1007/978-3-030-48340-1_40

Cited by 4 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the following we describe the method of grouping jobs, learning good co‐location of groups, and the evaluation of the scheduling decisions by Hugo. More details about the calculation of the scheduling probabilities as well as the prototype implementation can be found in the previous publication about Hugo 3 …”

Section: Cluster Scheduling Methods and Experimentsmentioning

confidence: 99%

“…We acknowledge the co‐authors of our previous publications on this topic, 1‐3 especially Benjamin Rabier, Ilya Verbitskiy, and Florian Schmidt. This study was funded by German Ministry for Education and Research (BMBF) as BBDC (01IS14013A and 01IS18025A).…”

Section: Acknowledgementsmentioning

confidence: 99%

“…This is an extended discussion of the works we published at the Big Data Congress 2017 1 (https://doi.org/10.1109/BigDataCongress.2017.28, © IEEE, 2017), in the STBD journal 4(1) 2 (https://doi.org/10.29268/stbd.2017.4.1.3), and the ParaMo workshop at Euro‐Par 2019 3 (to appear, © Springer, 2019), presenting all our scheduler variants together in one article for the first time. In comparison to our previous publications, we also added a new motivation and a comparison of the related work.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

Thamsen

Beilharz

Tran

et al. 2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary Distributed data‐parallel processing systems like MapReduce, Spark, and Flink are popular for analyzing large datasets using cluster resources. Resource management systems like YARN or Mesos in turn allow multiple data‐parallel processing jobs to share cluster resources in temporary containers. Often, the containers do not isolate resource usage to achieve high degrees of overall resource utilization despite overprovisioning and the often fluctuating utilization of specific jobs. However, some combinations of jobs utilize resources better and interfere less with each other when running on the same shared nodes than others. This article presents an approach for improving the resource utilization and job throughput when scheduling recurring distributed data‐parallel processing jobs in shared clusters. The approach is based on reinforcement learning and a measure of co‐location goodness to have cluster schedulers learn over time which jobs are best executed together on shared resources. We evaluated this approach over the last years with three prototype schedulers that build on each other: Mary, Hugo, and Hugo*. For the evaluation we used exemplary Flink and Spark jobs from different application domains and clusters of commodity nodes managed by YARN. The results of these experiments show that our approach can increase resource utilization and job throughput significantly.

show abstract

Section: Cluster Scheduling Methods and Experimentsmentioning

confidence: 99%

Section: Acknowledgementsmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

Thamsen

Beilharz

Tran

et al. 2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…As mentioned in previous sections, ASA X is a stateful extension of [36], where the main difference is that ASA X can incorporate previous decisions in a RL approach. In [39], the authors combine offline job classification with online RL to improve collocations. This approach can accelerate convergence, although it might have complex consequences when new unknown jobs do not fit into the initial classification.…”

Section: Related Workmentioning

confidence: 99%

A HPC Co-scheduler with Reinforcement Learning

Souza

Pelckmans

Tordsson

2021

Job Scheduling Strategies for Parallel Processing

View full text Add to dashboard Cite

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications' actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.

show abstract

“…In cloud computing ecosystems, consolidating multiple user applications onto multi-core servers generates interference between co-hosted applications, which impacts application performance. To minimize interference effects and improve application performance, a common solution is to utilize schedulers that consider interference issues [26].…”

Section: Interference-aware Schedulingmentioning

confidence: 99%

ML-driven classification scheme for dynamic interference-aware resource scheduling in cloud infrastructures

Meyer

Kirchoff

Silva

et al. 2021

Journal of Systems Architecture

View full text Add to dashboard Cite

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Cited by 4 publications

References 20 publications

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

A HPC Co-scheduler with Reinforcement Learning

ML-driven classification scheme for dynamic interference-aware resource scheduling in cloud infrastructures

Contact Info

Product

Resources

About