2015
DOI: 10.1007/978-3-319-20943-2
|View full text |Cite
|
Sign up to set email alerts
|

Fault-Tolerance Techniques for High-Performance Computing

Abstract: This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via faul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 70 publications
(1 citation statement)
references
References 55 publications
0
1
0
Order By: Relevance
“…Checkpoint-rollback recovery is a straightforward and popular black-box solution to recover from faults in simulation software [42,43,44]. During run time, the software regularly creates snapshots of the simulation data.…”
Section: Checkpointing and Resiliencementioning
confidence: 99%
“…Checkpoint-rollback recovery is a straightforward and popular black-box solution to recover from faults in simulation software [42,43,44]. During run time, the software regularly creates snapshots of the simulation data.…”
Section: Checkpointing and Resiliencementioning
confidence: 99%