2020
DOI: 10.1016/j.future.2020.01.026
|View full text |Cite
|
Sign up to set email alerts
|

Fault tolerance of MPI applications in exascale systems: The ULFM solution

Abstract: The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 38 publications
(7 citation statements)
references
References 39 publications
(50 reference statements)
0
7
0
Order By: Relevance
“…Although not yet part of the standard, the proposed ULFM extensions have already been applied in a number of studies (see, e.g. Ali et al, 2014; Ashraf et al, 2018; Bland et al, 2013; Cantwell and Nielsen, 2019; Engwer et al, 2018; Fagg and Dongarra, 2000; Gamell et al, 2017a, 2017b; Losada et al, 2020; Teranishi and Heroux, 2014).…”
Section: Resilience Methodologiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Although not yet part of the standard, the proposed ULFM extensions have already been applied in a number of studies (see, e.g. Ali et al, 2014; Ashraf et al, 2018; Bland et al, 2013; Cantwell and Nielsen, 2019; Engwer et al, 2018; Fagg and Dongarra, 2000; Gamell et al, 2017a, 2017b; Losada et al, 2020; Teranishi and Heroux, 2014).…”
Section: Resilience Methodologiesmentioning
confidence: 99%
“…User-level failure mitigation (ULFM) extensions to the MPI standard provide an application-driven mechanism to detect and recover communication channels in the event of a hard fault leading to the loss of one or more processes. These extensions are currently under consideration for inclusion in the MPI 4.x standard (Losada et al, 2020). In prior versions of MPI, communication routines would either trigger an immediate abortion of the program or return control to the application to support a more controlled termination.…”
Section: User-level Failure Mitigation and Advanced Checkpointingmentioning
confidence: 99%
“…However, MPI was found efficient only when the size of the data set is small or moderate [17]. Moreover, MPI does not provide any fault-tolerant constructs for users to handle failures [18]. Thus, the recovery procedure is activated only when the application is aborted.…”
Section: Related Workmentioning
confidence: 99%
“…Several programming models include now resilience support. In Reference [ 82 ], authors provide a detailed analysis of the resilience features of the different programing languages, grouped by paradigm: message passing (e.g., MPI-ULFM [ 118 ]), partitioned global address space (e.g., UPC++ [ 12 ]), asynchronous partitioned global address space (e.g., X10 [ 32 ]), actor (e.g., Erlang [ 174 ]), dataflow (e.g., Legion [ 13 ]). Table 2 summarizes the comparison developed in Reference [ 82 ].…”
Section: Programming Models and Runtime Managersmentioning
confidence: 99%