2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016
DOI: 10.1109/ipdps.2016.39
|View full text |Cite
|
Sign up to set email alerts
|

Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors

Abstract: International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
23
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
4
3
3

Relationship

6
4

Authors

Journals

citations
Cited by 21 publications
(25 citation statements)
references
References 21 publications
1
23
0
Order By: Relevance
“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”
Section: General Process Replicationsupporting
confidence: 91%
“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”
Section: General Process Replicationsupporting
confidence: 91%
“…Di et al [12] analyzed a two-level computational pattern, and proved that equal-length checkpointing segments constitute the optimal solution. Benoit et al [3] relied on disk checkpoints to cope with fail-stop failures and used memory checkpoints coupled with error detectors to handle silent data corruptions. They derived first-order approximation formulas for the optimal pattern length as well as the number of memory checkpoints between two disk checkpoints.…”
Section: Checkpointingmentioning
confidence: 99%
“…Checkpointing with rollback recovery [17,23] is the de-facto general-purpose recovery technique in high-performance computing. Finding the optimal checkpointing interval [7,19,21,49] or the optimal recovery method for SPH codes is beyond the scope of this paper.…”
Section: Error Correctionmentioning
confidence: 99%