Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn; Supinski, Bronis R. de

doi:10.1109/sc.2010.18

Cited by 398 publications

(376 citation statements)

References 23 publications

Supporting

Mentioning

373

Contrasting

Unclassified

Order By: Relevance

“…Latent errors, also known as silent errors or silent data corruption, represent a major threat to scientific applications executing on large scale platforms [21,22,23]. There are several causes of silent errors, such as cosmic radiation, packaging pollution, among others.…”

Section: Related Workmentioning

confidence: 99%

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Cavelan

Fang

Chien

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.

show abstract

Section: Related Workmentioning

confidence: 99%

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Cavelan

Fang

Chien

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Moody et al introduced multi-level checkpointing to improve scalability [29]. Traditional checkpoint systems use the parallel file system (PFS) to store the checkpoint data.…”

Section: Related Workmentioning

confidence: 99%

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

“…On the other hand, application-level checkpoint assumes that the state of the tasks is enough to resume the execution of the program in case of a failure. The SCR library [3] uses this approach. One advantage of application-level checkpoint is to dramatically reduce the amount of memory to be checkpointed.…”

Section: A Checkpoint/restartmentioning

confidence: 99%

“…This paper compares three standard checkpoint-based fault tolerance methods according to their energy consumption. The first method is the traditional checkpoint/restart based on local storage that has been implemented in several libraries [3], [4]. The second strategy is a particular version of message-logging [5] that requires messages to be stored, but avoids a global rollback in case of a failure.…”

Section: Introductionmentioning

confidence: 99%

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Meneses

Sarood

Kalé

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

show abstract

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Cited by 398 publications

References 23 publications

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Deduplication Potential of HPC Applications’ Checkpoints

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Contact Info

Product

Resources

About