2020
DOI: 10.1007/s11227-020-03506-5
|View full text |Cite
|
Sign up to set email alerts
|

OKCM: improving parallel task scheduling in high-performance computing systems using online learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(9 citation statements)
references
References 32 publications
0
9
0
Order By: Relevance
“…In [16], authors propose GARLSched, which uses reinforcement learning to take several task informations into account, and can be optimized for different workloads. In [17], authors propose an efficient running time prediction model, referred to as online learning and KNN-based predictor with correction mechanism, called OKCM. In [18], authors propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction, which estimates the state of the system by using a dynamic job remaining runtime predictor and learns the best policy to select or kill jobs based on the state by imitation learning and approximate policy optimization algorithms.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In [16], authors propose GARLSched, which uses reinforcement learning to take several task informations into account, and can be optimized for different workloads. In [17], authors propose an efficient running time prediction model, referred to as online learning and KNN-based predictor with correction mechanism, called OKCM. In [18], authors propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction, which estimates the state of the system by using a dynamic job remaining runtime predictor and learns the best policy to select or kill jobs based on the state by imitation learning and approximate policy optimization algorithms.…”
Section: Related Workmentioning
confidence: 99%
“…However, the neglect of the computing resource state may lead to unbalanced data placement and task allocation in wide-area environments, and reduce the computing efficiency. In recent years, there have been several efforts to accomplish task rescheduling or data redistribution through machine learning and heuristic algorithms, but again, they do not consider the relationship between tasks and data [16][17][18][19]. To summarize, the aforementioned optimization methods mainly focus on optimization through one aspect of task rescheduling or data redistribution rather than on a comprehensive approach, which cannot meet the demands of global performance optimization.…”
Section: Introductionmentioning
confidence: 99%
“…In HPC, job scheduling has been a long-standing research topic [2][3][4][5][6][7][8][9][10][11][12]29]. Maximizing resource utilization, reducing resource fragmentation, and improving user satisfaction have always been the goals of researchers.…”
Section: Related Workmentioning
confidence: 99%
“…Jobs are classified into long and short jobs through classification, and short jobs are executed first [30]. In addition, there are some optimized backfilling algorithms: combine job runtime prediction with easy backfilling [31], and dynamically adjust the estimated job completion time during the simulation process [11,12].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation