Assuming Failure Independence: Are We Right to be Wrong?

Aupy, Guillaume; Robert, Yves; Vivien, Frédéric

doi:10.1109/cluster.2017.24

Cited by 7 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[62] introduces a dynamic strategy called lazy checkpointing to adjust to changes in the failure rate. Another approach has been proposed in [4], using quantiles of consecutive IAT pairs. It is an open problem to derive an efficient checkpoint strategy that can account for temporal or spatial dependence between failures.…”

Section: Discussionmentioning

confidence: 99%

Checkpointing à la Young/Daly: An Overview

Benoît

Hérault³

et al. 2022

Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Section: Discussionmentioning

confidence: 99%

Checkpointing à la Young/Daly: An Overview

Benoît

Hérault³

et al. 2022

Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing

View full text Add to dashboard Cite

show abstract

“…We use the two traces featuring the largest number of failures from the LANL archive [26,27], namely LANL#2 and LANL#18. According to the detailed study in [2], failures in LANL#18 are not correlated while those in LANL#2 are correlated, providing perfect candidates to experimentally study the impact of failure distributions. LANL#2 has an MTBF of 14.1 hours and is composed of 5350 failures, while LANL#18 has an MTBF of 7.5 hours and is composed of 3899 failures.…”

Section: Model Accuracymentioning

confidence: 99%

Replication is more efficient than you think

Benoît

Hérault

Fèvre

et al. 2019

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

This paper revisits replication coupled with checkpointing for failstop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period T no MTTI = √ 2MC à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period T rs opt for this strategy, which is much larger than T no MTTI , thereby decreasing I/O pressure. We show through simulations that using T rs opt and the restart strategy, instead of T no MTTI and the usual no-restart strategy, significantly decreases the overhead induced by replication, in terms of both total execution time and energy consumption.

show abstract