2020
DOI: 10.1007/978-3-030-50743-5_27
|View full text |Cite
|
Sign up to set email alerts
|

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Abstract: Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, redeploying an application incurs overhead by tearing down and reinstating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit ++ , a new design and implemen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(17 citation statements)
references
References 35 publications
0
17
0
Order By: Relevance
“…MPI_ReInit (Georgakoudis et al, 2020; Laguna et al, 2016) is a similar approach, but it does not expose any low-level APIs like MPI-ULFM. Instead, it focuses on efficient online rollback recovery, simplifying the low-level fault detection and notification mechanism accommodated by MPI-ULFM.…”
Section: Resilience Methodologiesmentioning
confidence: 99%
See 1 more Smart Citation
“…MPI_ReInit (Georgakoudis et al, 2020; Laguna et al, 2016) is a similar approach, but it does not expose any low-level APIs like MPI-ULFM. Instead, it focuses on efficient online rollback recovery, simplifying the low-level fault detection and notification mechanism accommodated by MPI-ULFM.…”
Section: Resilience Methodologiesmentioning
confidence: 99%
“…MPI_ReInit does not specify any data recovery schemes, allowing the use of external software. The latest implementation (Georgakoudis et al, 2020) demonstrates better scalability than MPI-ULFM in the absence of failures. On data recovery for MPI programs, Global View Resilience (GVR, Chien et al, 2015) and VeloC (Nicolae et al, 2019) accommodate generic APIs for data persistency.…”
Section: Fenix and Mpi_reinitmentioning
confidence: 99%
“…Note that we choose to evaluate different fault tolerance techniques by triggering a process failure, which does not mean that the MPI recovery frameworks do not support recovery from a node failure. On the one hande Reinit can recover from a node failure [14], on the other hand the current ULFM implementation cannot. In our case, it is sufficient to evaluate on MPI process failures to compare the performance difference when using FTI checkpointing in ULFM and Reinit.…”
Section: Fault Injectionmentioning
confidence: 99%
“…To implement FTI with Reinit, the only thing to notice is to move the FTI Init() and FTI Finalize() functions into the resilient main() function as well. Please read work on Reinit [13], [14], [19] for the design and implementation details of Reinit.…”
Section: B Fti With Reinit Implementationmentioning
confidence: 99%
See 1 more Smart Citation