2005
DOI: 10.1177/1094342005056139
|View full text |Cite
|
Sign up to set email alerts
|

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Abstract: As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
169
0
4

Year Published

2007
2007
2012
2012

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 198 publications
(173 citation statements)
references
References 19 publications
(22 reference statements)
0
169
0
4
Order By: Relevance
“…Overall there is only one other project working on checkpointing in heterogeneous grid environments while there are different projects implementing checkpointing for MPI applications [11]. However, there are many publications proposing sophisticated checkpointing protocols but that are not related to heterogeneity challenges addressed by this paper.…”
Section: Related Workmentioning
confidence: 99%
“…Overall there is only one other project working on checkpointing in heterogeneous grid environments while there are different projects implementing checkpointing for MPI applications [11]. However, there are many publications proposing sophisticated checkpointing protocols but that are not related to heterogeneity challenges addressed by this paper.…”
Section: Related Workmentioning
confidence: 99%
“…Our work enhances LAM/MPI and BLCR [43], [15], [41], which previously was restricted to reactive FT, to a proactive live migration scheme. LAM/MPI+BLCR originally required a complete system restart for roll-back to the last checkpoint upon failure, but a number of approaches have been designed to allow (a) selected checkpoint images to be restarted on new nodes [6], (b) node and head-node failure [47], and (c) a job-pause mechanism that supports migration without restart [51].…”
Section: Related Workmentioning
confidence: 99%
“…In the context of HPC, many MPI implementations have been retrofitted with or design for FT, ranging from automatic methods (checkpoint-based or log-based) [44], [41], [5] to nonautomated approaches [3], [17].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…They use an MPI checkpointing system for the LAM MPI implementation [23] based on Berkeley Lab's Linux Checkpoint/Restart [14], a kernel level checkpointing system. As MPI stores the location of the processes, they have to modify the checkpoints before restarting.…”
Section: Introductionmentioning
confidence: 99%