Predicting job completion times using system logs in supercomputing clusters

Chen, Xin; Lü, Chao; Pattabiraman, Karthik

doi:10.1109/dsnw.2013.6615513

Cited by 27 publications

(10 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior research activities have centered on analyzing error logs [1][2][3][4][5][6] as well as some online analysis for patterns preceding a failure, and evaluated the accuracy and efficacy of anomaly detection and proactive response [12,13]. They have addressed one or more of the following issues: basic error characteristics [1,2,5], modeling and evaluation [6,14,15], failure prediction and proactive checkpointing [16,17]. There are many challenges in systematically studying large-scale systems using operational data, such as data availability, data collection/mining and fault/failure characterization.…”

Section: Related Workmentioning

confidence: 99%

LogDiver

Martino

Jha

Kramer

et al. 2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

This paper presents LogDiver, a tool for the analysis of application-level resiliency in extreme-scale computing systems. The tool has been implemented to handle data generated by system monitoring tools in Blue Waters, the petascale machine in production at the University of Illinois' National Center for Supercomputing Applications. The tool is able: i) to filter, extract, and classify error data from different sources of information, such as system logs, hardware sensors and workload logs; ii) to extract signals from the categorized errors; iii) to consolidate user application data and decode application and job exit status, highlighting the reasons for the application/job exit; and iv) to correlate application failures with errors using a mix of empirical and analytical techniques. To the best of our knowledge, this is the first tool capable of measuring application-level resiliency in extreme-scale machines. We also demonstrate the power of the tool by showing that XK applications are more vulnerable to failures when compared to XE applications.

show abstract

Section: Related Workmentioning

confidence: 99%

LogDiver

Martino

Jha

Kramer

et al. 2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

show abstract

“…Besides memory usage prediction, a lot of other researchers focus on predicting more other job related metrics, such as job runtimes [6,7,10,11], job queue-waiting time [12,13], job start times [20], job completion time [14], power usage [16,17], etc. Instead of using job submission records, some of the work are directly using the recent previous runs of the same applications to calculate the runtimes for next jobs, which limited to certain suitable scenarios.…”

Section: Related Workmentioning

confidence: 99%

Practical Resource Usage Prediction Method for Large Memory Jobs in HPC Clusters

et al. 2019

Supercomputing Frontiers

View full text Add to dashboard Cite

Users in high performance computing (HPC) clusters normally face challenges to specify accurate resource estimates for running their applications as batch jobs. Prediction is a common way to alleviate this complexity by using historical job records of previous runs to estimate resource usage for new coming jobs. Most of existing resource prediction methods directly build a single model to consider all of the jobs in clusters. However, people in production usage tend to only focus on the resource usage of jobs with certain patterns, e.g. jobs with large memory consumption. This paper proposes a practical resource prediction method for large memory jobs. The proposed method first tries to predict whether a job tends to use large memory size, and then predicts the final memory usage using a model which is trained by only historical large memory jobs. Using several real-world job traces collected from large production clusters of IBM Spectrum LSF customer sites, the evaluation results show that the average prediction errors can be reduced up to 40% for nearly 90% of large memory jobs. Meanwhile, the model training cost can be reduced over 30% for the evaluated job traces.

show abstract

“…Huge volume of log data are being collected from distributed systems that are widely used in critical application domains. Such large-scale log data have been used for program verification, [15][16][17] performance monitoring, 18 failure analysis, 19,20 and security audits, 21,22 as well as for detecting anomalies that occur in these systems. 1,4,5 There are several studies on log parsing, which is a crucial step of log analysis.…”

Section: Related Workmentioning

confidence: 99%

DILAF: A framework for distributed analysis of large‐scale system logs for anomaly detection

2018

View full text Add to dashboard Cite

Summary System logs constitute a rich source of information for detection and prediction of anomalies. However, they can include a huge volume of data, which is usually unstructured or semistructured. We introduce DILAF, a framework for distributed analysis of large‐scale system logs for anomaly detection. DILAF is comprised of several processes to facilitate log parsing, feature extraction, and machine learning activities. It has two distinguishing features with respect to the existing tools. First, it does not require the availability of source code of the analyzed system. Second, it is designed to perform all the processes in a distributed manner to support scalable analysis in the context of large‐scale distributed systems. We discuss the software architecture of DILAF and we introduce an implementation of it. We conducted controlled experiments based on two datasets to evaluate the effectiveness of the framework. In particular, we evaluated the performance and scalability attributes under various degrees of parallelism. Results showed that DILAF can maintain the same accuracy levels while achieving more than 30% performance improvement on average as the system scales, compared to baseline approaches that do not employ fully distributed processing.

show abstract

Predicting job completion times using system logs in supercomputing clusters

Cited by 27 publications

References 11 publications

LogDiver

LogDiver

Practical Resource Usage Prediction Method for Large Memory Jobs in HPC Clusters

DILAF: A framework for distributed analysis of large‐scale system logs for anomaly detection

Contact Info

Product

Resources

About