2018
DOI: 10.1002/cpe.4863
|View full text |Cite
|
Sign up to set email alerts
|

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Abstract: Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount.The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(18 citation statements)
references
References 50 publications
0
17
0
Order By: Relevance
“…GPU snapshot was designed to reduce checkpointing cost using asynchronous checkpoint offloading from GPUs to hosts [27]. Chakraborty et al proposed EREINIT to reduce checkpointing overhead for bulk-synchronous MPI applications [9] by implementing fault-tolerance in low-level software layers. Application-level checkpointing approaches save only the main data structures and their metadata for checkpointing [6].…”
Section: Related Workmentioning
confidence: 99%
“…GPU snapshot was designed to reduce checkpointing cost using asynchronous checkpoint offloading from GPUs to hosts [27]. Chakraborty et al proposed EREINIT to reduce checkpointing overhead for bulk-synchronous MPI applications [9] by implementing fault-tolerance in low-level software layers. Application-level checkpointing approaches save only the main data structures and their metadata for checkpointing [6].…”
Section: Related Workmentioning
confidence: 99%
“…2) Reinit: Reinit [13], [14], [19] is an alternative recovery framework designed particularly for global backward nonshrinking recovery. Reinit implements the recovery process into the MPI runtime, thus it is transparent to users.…”
Section: Failure Recovery Interface -Ulfm and Reinitmentioning
confidence: 99%
“…Reinit provides a simple interface to programmers to define a global restart point, in the form of a resilient target function. The early versions [13], [19], [36], [37] of Reinit have limited usage because they require hard-to-deploy changes to job schedulers. Most recently, Georgakoudis et al [14] propose a new design and implementation of Reinit into the Open MPI runtime.…”
Section: Related Workmentioning
confidence: 99%
“…2) MPI Layer Fault Tolerance: Fault-tolerant MPI mechanisms have been an object of investigation for many years now [15], [19], [34]. Some popular mechanisms for fault tolerance in MPI are ULFM [2], FT-MPI [34] and MPI Reinit [8], [22]. The common goal of these frameworks is to provide a mechanism for the developers to cope with process failures, allowing them to continue the execution without the need to launch a new MPI job.…”
Section: Introductionmentioning
confidence: 99%