Machine learning based job status prediction in scientific clusters

Yoo, Wucherl; Sim, Alex; Wu, Kesheng

doi:10.1109/sai.2016.7555961

Cited by 14 publications

(9 citation statements)

References 21 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Papers [2][3][4][5] demonstrate related research, such as job status prediction, failure prediction and anomaly detection, based on log file analysis with machine learning with good results. Whether abnormal detection or job status prediction, the number of correct instances (majority class) should be much more than the number of incorrect instances (minority class) in a dataset, which leads to an imbalanced dataset just like our dataset presented here.…”

Section: Discussionmentioning

confidence: 99%

“…Klinkenberg et al [2] proposed and evaluated a method for predicting failures with framed cluster monitoring data and extracted features describing the characteristic of the signals. Authors in [3] presented a machine learning based Random forests (RF) classification model for predicting unsuccessful job executions. In modern supercomputing centers, successful or health jobs occupy a very large part of job databases.…”

Section: Related Workmentioning

confidence: 99%

“…As they are normalized by time-based value, those normalized performance related measurements serve as appropriate data sources for machine learning based job status prediction. Indeed, this can also be extended to online prediction [3]. In total, 14.3 million jobs were recorded, with a total database size of 8.5 GiB.…”

Section: Data Collection and Feature Engineeringmentioning

confidence: 99%

“…In most HPC systems, there are a huge number of jobs submitted by thousands of users who are potentially grouped into hundreds of user groups. In relevant research about job logs analysis, researchers usually divide logs into subsets with different rules or purposes for seeking hidden patterns from those logs [1][2][3].…”

Section: Classification With Subset Dataset Categorized By Scientificmentioning

confidence: 99%

See 3 more Smart Citations

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Guo

Nomura

Barton

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In modern high-performance computing (HPC) systems, users are usually requested to estimate the job runtime for system scheduling when they submit a job. In general, an underestimation of job runtime will cause the HPC system to terminate the job before its completion. If users could be notified that their jobs may not finish before its allocated time expires, users can take actions, such as killing the job and resubmitting it after parameter adjustment, to save time and cost. Meanwhile, the productivity of HPC systems could also be vastly improved. In this paper, we propose a data-driven approach -that is, one that actively observes, analyzes, and logs jobs -for predicting underestimation of job runtime on HPC systems. Using data produced by TSUBAME 2.5, a supercomputer deployed at the Tokyo Institute of Technology, we apply machine learning algorithms to recognize patterns about whether the underestimation of job runtime occurs. Our experimental results show that our approach on runtime-underestimation prediction with 80% precision, 70% recall and 74% F1-score on the entirety of a given dataset. Finally, we split the entire job data set into subsets categorized by scientific application name. The best precision, recall and F1-score of subsets on runtime-underestimation prediction achieved 90%, 95% and 92% respectively.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Data Collection and Feature Engineeringmentioning

confidence: 99%

Section: Classification With Subset Dataset Categorized By Scientificmentioning

confidence: 99%

See 2 more Smart Citations

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Guo

Nomura

Barton

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…To grasp the complex relation among system components and to discover early symptoms of failures to come, several works rely on ML techniques, e.g., [10,[71][72][73][74][75]. Abu-Samah et al [74] rely on Bayesian networks; as a drawback, this approach requires to be complemented with the extraction and validation of system patterns, which may involve expert opinions or elicitations on several levels.…”

Section: Related Workmentioning

confidence: 99%

Event-based failure prediction in distributed business processes

Borkowski¹,

Fdhila²,

Nardelli³

et al. 2019

Information Systems

View full text Add to dashboard Cite

Traditionally, research in Business Process Management has put a strong focus on centralized and intra-organizational processes. However, today's business processes are increasingly distributed, deviating from a centralized layout, and therefore calling for novel methodologies of detecting and responding to unforeseen events, such as errors occurring during process runtime. In this article, we demonstrate how to employ event-based failure prediction in business processes. This approach allows to make use of the best of both traditional Business Process Management Systems and event-based systems. Our approach employs machine learning techniques and considers various types of events. We evaluate our solution using two business process data sets, including one from a real-world event log, and show that we are able to detect errors and predict failures with high accuracy.

show abstract

FP‐JSC: Job failure prediction on supercomputers through job application sequence correlation

Xian,

Yang,

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummarySupercomputers are advanced computing systems interconnected through high‐speed communication networks, consisting of independent computational nodes. During the unfolding of the big data era, the potent computational capabilities of these supercomputers play a pivotal role in scientific computing. Despite executing numerous advanced computational science and engineering tasks on supercomputers, many submitted jobs fail due to various factors, resulting in user inefficiencies. These failures not only consume system resources but also reduce the overall efficiency of the system. Previous research often couples job performance features with a single machine learning method for predicting job failure. However, a primary hurdle emerges from the high cost of gathering these features, complicating their real‐world applicability. To address this challenge, our study establishes correlations among job applications through extensive job log analysis. Leveraging correlations, we propose a predictive framework based on job application sequence correlation (called FP‐JSC). This innovative framework employs multiple machine learning models to offer holistic predictions, selecting the most suitable model based on its learning effectiveness. Moreover, the framework optimizes feature collection expenses without adversely affecting job execution. We determine job applications using both job paths and job names, with the former emerging as a novel feature derived from supplementary monitoring data. Empirical results underscore FP‐JSC's effectiveness, accurately identifying over 89% of jobs with 95% specificity and 89% sensitivity—outperforming single prediction methods employed in related works.

show abstract

Machine learning based job status prediction in scientific clusters

Cited by 14 publications

References 21 publications

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Event-based failure prediction in distributed business processes

FP‐JSC: Job failure prediction on supercomputers through job application sequence correlation

Contact Info

Product

Resources

About