Proceedings of the ACM/IEEE SC2004 Conference
DOI: 10.1109/sc.2004.29
|View full text |Cite
|
Sign up to set email alerts
|

Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Abstract: The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. Therefore, to run to completion, these applications must tolerate hardware failures.Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this -the state of computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CP… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
38
0

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 57 publications
(38 citation statements)
references
References 15 publications
0
38
0
Order By: Relevance
“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”
Section: Related Workmentioning
confidence: 99%
“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”
Section: Related Workmentioning
confidence: 99%
“…Representative works include failure-aware resource management and scheduling [10,15,20], checkpointing [6,18,24,38], proactive or adaptive runtime resilience support [14,29]. The advance of these technologies, however, greatly depends on whether we can predict the occurrence of failure, i.e., failure prediction.…”
Section: Motivationsmentioning
confidence: 99%
“…System-level checkpoints at remote storage cause large amounts of data to be sent through the network, but applicationlevel checkpoints require modifications of the application code, and as such are not completely transparent to the programmer, in the sense that a code written for a non-fault-tolerant implementation of MPI requires some modifications to be executed on a fault-tolerant implementation of MPI using application-level checkpoints [Schulz et al 2004] [Bronevetsky et al 2003]. …”
Section: Related Workmentioning
confidence: 99%