ER<scp>einit</scp>: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali; Mohror, Kathryn; Panda, Dhabaleswar K.; Schulz, Martin; Subramoni, Hari

doi:10.1002/cpe.4863

Cited by 21 publications

(18 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPU snapshot was designed to reduce checkpointing cost using asynchronous checkpoint offloading from GPUs to hosts [27]. Chakraborty et al proposed EREINIT to reduce checkpointing overhead for bulk-synchronous MPI applications [9] by implementing fault-tolerance in low-level software layers. Application-level checkpointing approaches save only the main data structures and their metadata for checkpointing [6].…”

Section: Related Workmentioning

confidence: 99%

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

Coy

Ren

et al. 2020

Proceedings of the 34th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Scientific applications use checkpointing for failure recovery. The existing checkpointing approaches were proposed for storing persistent states of applications as checkpoints in disk-based file systems via the block interface. As non-volatile main memory (NVMM) will be included in high-performance computing systems, storing the checkpoints in NVMM-based file systems can significantly waste the performance benefits of NVMM. This is because it underutilizes memory resources and it does not take advantage of the byte-addressability of NVMM. In this paper, we propose an NVMM-aware checkpointing approach, named NV-Checkpoint. It uses a compiler-aided technique to automatically generate multi-version data structures, which consist of both the persistent version of data stored in NVMM for failure recovery and the ephemeral version of data placed across DRAM and NVMM. Because of the byte-addressability of NVMM, any versions can be accessed via the memory interface. The multiple versions may share data that are not mutated during the program's execution to reduce data redundancy. NV-Checkpoint provides the same level of guarantee of failure recovery compared to the conventional checkpointing approaches proposed for file systems. Furthermore, its runtime system manages the layout of the data structures to reduce the number of writes to NVMM. It also manages the checkpointing frequency to reduce persistence overhead using machine learning models. Our experimental results with real-world scientific applications show that the performance of annotated programs with NV-Checkpoint using a hybrid of DRAM and NVMM matches the performance of best-effort handwritten versions. It achieves similar scalability as those with ephemeral data structures using only DRAM. It offers up to 121X speedup of execution time

show abstract

Section: Related Workmentioning

confidence: 99%

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

Coy

Ren

et al. 2020

Proceedings of the 34th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…2) Reinit: Reinit [13], [14], [19] is an alternative recovery framework designed particularly for global backward nonshrinking recovery. Reinit implements the recovery process into the MPI runtime, thus it is transparent to users.…”

Section: Failure Recovery Interface -Ulfm and Reinitmentioning

confidence: 99%

“…Reinit provides a simple interface to programmers to define a global restart point, in the form of a resilient target function. The early versions [13], [19], [36], [37] of Reinit have limited usage because they require hard-to-deploy changes to job schedulers. Most recently, Georgakoudis et al [14] propose a new design and implementation of Reinit into the Open MPI runtime.…”

Section: Related Workmentioning

confidence: 99%

MATCH: An MPI Fault Tolerance Benchmark Suite

Guo

Georgakoudis

Parasyris

et al. 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

Self Cite

View full text Add to dashboard Cite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

show abstract

“…2) MPI Layer Fault Tolerance: Fault-tolerant MPI mechanisms have been an object of investigation for many years now [15], [19], [34]. Some popular mechanisms for fault tolerance in MPI are ULFM [2], FT-MPI [34] and MPI Reinit [8], [22]. The common goal of these frameworks is to provide a mechanism for the developers to cope with process failures, allowing them to continue the execution without the need to launch a new MPI job.…”

Section: Introductionmentioning

confidence: 99%

Design and Study of Elastic Recovery in HPC Applications

Keller

Parasyris

Bautista-Gomez

2020

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a selfdescriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.

show abstract

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Cited by 21 publications

References 50 publications

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

Compiler aided checkpointing using crash-consistent data structures in NVMM systems

MATCH: An MPI Fault Tolerance Benchmark Suite

Design and Study of Elastic Recovery in HPC Applications

Contact Info

Product

Resources

About