2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W) 2013
DOI: 10.1109/dsnw.2013.6615513
|View full text |Cite
|
Sign up to set email alerts
|

Predicting job completion times using system logs in supercomputing clusters

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
10
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 27 publications
(10 citation statements)
references
References 11 publications
0
10
0
Order By: Relevance
“…Prior research activities have centered on analyzing error logs [1][2][3][4][5][6] as well as some online analysis for patterns preceding a failure, and evaluated the accuracy and efficacy of anomaly detection and proactive response [12,13]. They have addressed one or more of the following issues: basic error characteristics [1,2,5], modeling and evaluation [6,14,15], failure prediction and proactive checkpointing [16,17]. There are many challenges in systematically studying large-scale systems using operational data, such as data availability, data collection/mining and fault/failure characterization.…”
Section: Related Workmentioning
confidence: 99%
“…Prior research activities have centered on analyzing error logs [1][2][3][4][5][6] as well as some online analysis for patterns preceding a failure, and evaluated the accuracy and efficacy of anomaly detection and proactive response [12,13]. They have addressed one or more of the following issues: basic error characteristics [1,2,5], modeling and evaluation [6,14,15], failure prediction and proactive checkpointing [16,17]. There are many challenges in systematically studying large-scale systems using operational data, such as data availability, data collection/mining and fault/failure characterization.…”
Section: Related Workmentioning
confidence: 99%
“…Besides memory usage prediction, a lot of other researchers focus on predicting more other job related metrics, such as job runtimes [6,7,10,11], job queue-waiting time [12,13], job start times [20], job completion time [14], power usage [16,17], etc. Instead of using job submission records, some of the work are directly using the recent previous runs of the same applications to calculate the runtimes for next jobs, which limited to certain suitable scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…Huge volume of log data are being collected from distributed systems that are widely used in critical application domains. Such large-scale log data have been used for program verification, [15][16][17] performance monitoring, 18 failure analysis, 19,20 and security audits, 21,22 as well as for detecting anomalies that occur in these systems. 1,4,5 There are several studies on log parsing, which is a crucial step of log analysis.…”
Section: Related Workmentioning
confidence: 99%