OKCM: improving parallel task scheduling in high-performance computing systems using online learning

Li, Jingbo; Zhang, Xingjun; Han, Ling; Ji, Zeyu; Dong, Xiaoshe; Hu, Chenglong

doi:10.1007/s11227-020-03506-5

Cited by 21 publications

(9 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [16], authors propose GARLSched, which uses reinforcement learning to take several task informations into account, and can be optimized for different workloads. In [17], authors propose an efficient running time prediction model, referred to as online learning and KNN-based predictor with correction mechanism, called OKCM. In [18], authors propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction, which estimates the state of the system by using a dynamic job remaining runtime predictor and learns the best policy to select or kill jobs based on the state by imitation learning and approximate policy optimization algorithms.…”

Section: Related Workmentioning

confidence: 99%

“…However, the neglect of the computing resource state may lead to unbalanced data placement and task allocation in wide-area environments, and reduce the computing efficiency. In recent years, there have been several efforts to accomplish task rescheduling or data redistribution through machine learning and heuristic algorithms, but again, they do not consider the relationship between tasks and data [16][17][18][19]. To summarize, the aforementioned optimization methods mainly focus on optimization through one aspect of task rescheduling or data redistribution rather than on a comprehensive approach, which cannot meet the demands of global performance optimization.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Joint Online Optimization of Task Rescheduling and Data Redistribution

Song¹,

Song²,

Xiao³

et al. 2023

Journal of Internet Technology

View full text Add to dashboard Cite

<p>Wide-area distributed computing environment is the main platform for storing large amounts of data and conducting wide-area computing. Tasks and data are jointly scheduled among multiple computing platforms to improve system efficiency. However, large network latency and limited bandwidth in wide-area networks may cause a large delay in scheduling information and data migration, which brings low task execution efficiency and a long time waiting for data. Traditional works mainly focus on allocating tasks based on data locality or distributing data replications, but optimizing task allocation or data placement alone is insufficient from a global perspective. To mitigate the impact of large network latency and limited bandwidth on system performance, joint online optimization of task rescheduling and data redistribution is proposed in this study. The task allocation and data placement can be adjusted collaboratively during the system running process through the task stealing and backfilling mechanism and the data replication placement mechanism. The experimental results indicate that compared with the state-of-the-art method, the proposed method improves the system throughput and computing resource utilization by 20.67% and 20.26% respectively, and can significantly reduce the global data migration costs.</p> <p> </p>

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Joint Online Optimization of Task Rescheduling and Data Redistribution

Song¹,

Song²,

Xiao³

et al. 2023

Journal of Internet Technology

View full text Add to dashboard Cite

show abstract

“…In HPC, job scheduling has been a long-standing research topic [2][3][4][5][6][7][8][9][10][11][12]29]. Maximizing resource utilization, reducing resource fragmentation, and improving user satisfaction have always been the goals of researchers.…”

Section: Related Workmentioning

confidence: 99%

“…Jobs are classified into long and short jobs through classification, and short jobs are executed first [30]. In addition, there are some optimized backfilling algorithms: combine job runtime prediction with easy backfilling [31], and dynamically adjust the estimated job completion time during the simulation process [11,12].…”

Section: Related Workmentioning

confidence: 99%

“…In addition, some schedulers calculate the priority based on more attributes such as UNICEF [4] and F1 [5]. In order to increase the utilization of HPC resources, these heuristic scheduling methods are combined with a backfilling mechanism [11,12]. Backfilling allows small jobs in the waiting queue to run in advance without affecting other jobs being executed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction

Wang

Zhang

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.

show abstract

HaDPA: A Data-Partition Algorithm for Data Parallel Applications on Heterogeneous HPC Platforms

Han

et al. 2022

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

OKCM: improving parallel task scheduling in high-performance computing systems using online learning

Cited by 21 publications

References 32 publications

Joint Online Optimization of Task Rescheduling and Data Redistribution

Joint Online Optimization of Task Rescheduling and Data Redistribution

RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction

HaDPA: A Data-Partition Algorithm for Data Parallel Applications on Heterogeneous HPC Platforms

Contact Info

Product

Resources

About