Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Schulz, Martin; Bronevetsky, Greg; Fernandes, Rohit; Marques, Daniel; Pingali, Keshav; Stodghill, Paul

doi:10.1109/sc.2004.29

Cited by 57 publications

(38 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”

Section: Related Workmentioning

confidence: 99%

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

“…Representative works include failure-aware resource management and scheduling [10,15,20], checkpointing [6,18,24,38], proactive or adaptive runtime resilience support [14,29]. The advance of these technologies, however, greatly depends on whether we can predict the occurrence of failure, i.e., failure prediction.…”

Section: Motivationsmentioning

confidence: 99%

A study of dynamic meta-learning for failure prediction in large-scale systems

Lan

Zheng

et al. 2010

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Despite years of study on failure prediction, it remains an open problem, especially in large-scale systems composed of vast amount of components. In this paper, we present a dynamic meta-learning framework for failure prediction. It intends to not only provide reasonable prediction accuracy, but also be of practical use in realistic environments. Two key techniques are developed to address technical challenges of failure prediction. One is meta-learning to boost prediction accuracy by combining the benefits of multiple predictive techniques. The other is a dynamic approach to dynamically obtain failure patterns from a changing training set and to dynamically extract effective rules by actively monitoring prediction accuracy at runtime. We demonstrate the effectiveness and practical use of this framework by means of real system logs collected from the production Blue Gene/L systems at Argonne National Laboratory and San Diego Supercomputer Center. Our case studies indicate that the proposed mechanism can provide reasonable prediction accuracy by forecasting up to 82% of the failures, with a runtime overhead less than 1.0 minute.

show abstract

“…System-level checkpoints at remote storage cause large amounts of data to be sent through the network, but applicationlevel checkpoints require modifications of the application code, and as such are not completely transparent to the programmer, in the sense that a code written for a non-fault-tolerant implementation of MPI requires some modifications to be executed on a fault-tolerant implementation of MPI using application-level checkpoints [Schulz et al 2004] [Bronevetsky et al 2003]. …”

Section: Related Workmentioning

confidence: 99%

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Coti¹,

Herault²,

Lemarinier³

et al. 2006

ACM/IEEE SC 2006 Conference (SC'06)

View full text Add to dashboard Cite

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.

show abstract

Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Cited by 57 publications

References 15 publications

Deduplication Potential of HPC Applications’ Checkpoints

Deduplication Potential of HPC Applications’ Checkpoints

A study of dynamic meta-learning for failure prediction in large-scale systems

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Contact Info

Product

Resources

About