The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian; Sahay, Vishal; Lumsdaine, Andrew; Duell, Jason; Hargrove, Paul; Roman, Eric

doi:10.1177/1094342005056139

Cited by 198 publications

(173 citation statements)

References 19 publications

(22 reference statements)

Supporting

Mentioning

169

Contrasting

Unclassified

Order By: Relevance

“…Overall there is only one other project working on checkpointing in heterogeneous grid environments while there are different projects implementing checkpointing for MPI applications [11]. However, there are many publications proposing sophisticated checkpointing protocols but that are not related to heterogeneity challenges addressed by this paper.…”

Section: Related Workmentioning

confidence: 99%

Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments

Mehnert-Spahn

Schoettner

2010

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Abstract. A grid checkpointing service providing migration and transparent fault tolerance is important for distributed and parallel applications executed in heterogeneous grids. In this paper we address the challenges of checkpointing and migrating communication channels of grid applications executed on nodes equipped with different checkpointer packages. We present a solution that is transparent for the applications and the underlying checkpointers. It also allows using single node checkpointers for distributed applications. The measurement numbers show only a small overhead especially with respect to large grid-applications where checkpointing may consume many minutes.

show abstract

Section: Related Workmentioning

confidence: 99%

Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments

Mehnert-Spahn

Schoettner

2010

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Our work enhances LAM/MPI and BLCR [43], [15], [41], which previously was restricted to reactive FT, to a proactive live migration scheme. LAM/MPI+BLCR originally required a complete system restart for roll-back to the last checkpoint upon failure, but a number of approaches have been designed to allow (a) selected checkpoint images to be restarted on new nodes [6], (b) node and head-node failure [47], and (c) a job-pause mechanism that supports migration without restart [51].…”

Section: Related Workmentioning

confidence: 99%

“…In the context of HPC, many MPI implementations have been retrofitted with or design for FT, ranging from automatic methods (checkpoint-based or log-based) [44], [41], [5] to nonautomated approaches [3], [17].…”

Section: Related Workmentioning

confidence: 99%

“…BLCR is an open source, systemlevel C/R implementation integrated with LAM/MPI via a callback function. The original LAM/MPI+BLCR combination [41] only provides reactive FT and requires a complete job restart from the last checkpoint including job resubmission in case of a node failure. Recent work enhances this capability with a job pause/continue mechanism that keeps an MPI job alive while a failed node is replaced by a spare node [51].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

“…They use an MPI checkpointing system for the LAM MPI implementation [23] based on Berkeley Lab's Linux Checkpoint/Restart [14], a kernel level checkpointing system. As MPI stores the location of the processes, they have to modify the checkpoints before restarting.…”

Section: Introductionmentioning

confidence: 99%

Load Balancing in the Bulk-Synchronous-Parallel Setting using Process Migrations

Bonorden

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

The Paderborn University BSP (PUB) library is a powerful C library that supports the development of bulk synchronous parallel programs for various parallel machines. To utilize idle times on workstations for parallel computations, we implement virtual processors using processes. These processes can be migrated to other hosts, when the load of the machines changes. In this paper we describe the implementation for a Linux workstation cluster. We focus on process migration and show first benchmarking results.

show abstract

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Cited by 198 publications

References 19 publications

Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments

Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments

Proactive process-level live migration and back migration in HPC environments

Load Balancing in the Bulk-Synchronous-Parallel Setting using Process Migrations

Contact Info

Product

Resources

About