2017 IEEE International Conference on Cluster Computing (CLUSTER) 2017
DOI: 10.1109/cluster.2017.24
|View full text |Cite
|
Sign up to set email alerts
|

Assuming Failure Independence: Are We Right to be Wrong?

Abstract: This report revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 23 publications
0
2
0
Order By: Relevance
“…[62] introduces a dynamic strategy called lazy checkpointing to adjust to changes in the failure rate. Another approach has been proposed in [4], using quantiles of consecutive IAT pairs. It is an open problem to derive an efficient checkpoint strategy that can account for temporal or spatial dependence between failures.…”
Section: Discussionmentioning
confidence: 99%
“…[62] introduces a dynamic strategy called lazy checkpointing to adjust to changes in the failure rate. Another approach has been proposed in [4], using quantiles of consecutive IAT pairs. It is an open problem to derive an efficient checkpoint strategy that can account for temporal or spatial dependence between failures.…”
Section: Discussionmentioning
confidence: 99%
“…We use the two traces featuring the largest number of failures from the LANL archive [26,27], namely LANL#2 and LANL#18. According to the detailed study in [2], failures in LANL#18 are not correlated while those in LANL#2 are correlated, providing perfect candidates to experimentally study the impact of failure distributions. LANL#2 has an MTBF of 14.1 hours and is composed of 5350 failures, while LANL#18 has an MTBF of 7.5 hours and is composed of 3899 failures.…”
Section: Model Accuracymentioning
confidence: 99%