2015
DOI: 10.1007/978-3-319-20943-2_3
|View full text |Cite
|
Sign up to set email alerts
|

Fault-Tolerant MPI

Abstract: As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI standard remains distressingly vague on the consequence of failures on MPI communications. In this chapter, we present the spectrum of techniques that can be applied to enable MPI application recovery, ranging from fully automatic to completely user driven. First, we present the effective deployment of most advanced checkpoint/restart techniques within the MPI implementation, so that failed process… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 87 publications
0
1
0
Order By: Relevance
“…It seems quite natural to accept redundancies to improve the fault tolerance of a system, e.g. by combining multiple physical storage components to a Redundant Array of Inexpensive Disks (RAID) system (Patterson et al, 1988) or by using techniques that can be applied to enable an automated MPI application recovery (Bouteiller, 2015). There may also be performance improvements associated with it.…”
Section: Accepting Redundant Computations In Parallel Applicationsmentioning
confidence: 99%
“…It seems quite natural to accept redundancies to improve the fault tolerance of a system, e.g. by combining multiple physical storage components to a Redundant Array of Inexpensive Disks (RAID) system (Patterson et al, 1988) or by using techniques that can be applied to enable an automated MPI application recovery (Bouteiller, 2015). There may also be performance improvements associated with it.…”
Section: Accepting Redundant Computations In Parallel Applicationsmentioning
confidence: 99%
“…Early works regarding fault tolerance, including dynamic message passing interface (MPI) programs with checkpointing and resilient versions of MPI, are described in the work of Agbaria and Friedman and by Bosilca et al with a relatively recent summary found in the work of Dongarra et al and of Bouteiller . Cappello et al have summarized recent developments in resiliency that targets exascale.…”
Section: Related Workmentioning
confidence: 99%