2016 IEEE 23rd International Conference on High Performance Computing (HiPC) 2016
DOI: 10.1109/hipc.2016.035
|View full text |Cite
|
Sign up to set email alerts
|

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Abstract: Copyright and reuse:The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.Copies of full items can be used for personal research or … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 9 publications
(17 citation statements)
references
References 21 publications
0
17
0
Order By: Relevance
“…In future, we plan to integrate the information from system messages into the analysis, rather than using that for postvalidation. Other researchers have reported promising results by analyzing the two sources of information together [10], [2]. While the proposed error statistic allows for a visual tracking of the system performance, we plan to build a statistical anomaly detection model that leverages the expected behavior of the statistic to identify appropriate thresholds for flagging an anomalous event.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In future, we plan to integrate the information from system messages into the analysis, rather than using that for postvalidation. Other researchers have reported promising results by analyzing the two sources of information together [10], [2]. While the proposed error statistic allows for a visual tracking of the system performance, we plan to build a statistical anomaly detection model that leverages the expected behavior of the statistic to identify appropriate thresholds for flagging an anomalous event.…”
Section: Discussionmentioning
confidence: 99%
“…The four anomalies that lie above the threshold are verified using the system log data. relied on message logs [23], [16], [8], [20], [19] or resource usage data [9], [1], or both [3], [10], [2]. Since this paper focuses on the detection task, we present a brief overview of related methods that deal with detecting faults.…”
Section: Related Workmentioning
confidence: 99%
“…Recent work which use resource use data and message logs for failure diagnosis [21], [22] and error detection [23], [24] has shown increased accuracy over using message logs alone. [21] provides partial diagnosis of system failures by using resource use data to identify resource anomalies, and provides a more precise diagnosis by using message log-analysis.…”
Section: Introductionmentioning
confidence: 99%
“…[24] combines analyses of message logs and resource use data but the focus is on error detection. [23] uses message logs and resource use data to increase the error handling time window, and [22] is focused on correlating resource usage and message logs with system failures. [25] combines analysis of RAS logs and job logs but the focus is on identifying failure characteristics in a cluster system.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation