MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

Bosilca, George; Bouteiller, Aurélien; Cappello, Franck; Djilali, Samir; Fedak, Gilles; Germain, Cécile; Hérault, Thomas; Lemarinier, Pierre; Lodygensky, Oleg; Magniette, Frédéric; Néri, Vincent; Selikhov, Anton

doi:10.1109/sc.2002.10048

Cited by 150 publications

(131 citation statements)

References 12 publications

Supporting

Mentioning

128

Contrasting

Unclassified

Order By: Relevance

“…In the context of HPC, many MPI implementations have been retrofitted with or design for FT, ranging from automatic methods (checkpoint-based or log-based) [44], [41], [5] to nonautomated approaches [3], [17].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Logbased methods exploit messages logging and optionally their temporal ordering, where the latter is required for asynchronous non-coordinated checkpointing. MPICH-V [5] implements three such protocols. It uses Condor's userlevel checkpoint library [29].…”

Section: Related Workmentioning

confidence: 99%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…The MPICH-V1 [7] After a crash, a re-executed process retrieves all lost receptions in the correct order by requesting them to its associated channel memory. The logging has however a major impact on the performance (bandwidth divided by 2) and requires a large number of channel memories.…”

Section: Performancesmentioning

confidence: 99%

P2P-MPI: A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids

Genaud

Rattanapoka

2006

J Grid Computing

View full text Add to dashboard Cite

show abstract

“…For large scale machines like the ASCI-Q machine, the mean time between failures (MTBF) for the whole system is estimated to be mere hours [1]. Thus system stability even in the face of failure of single components is an important goal.…”

Section: Introductionmentioning

confidence: 99%

The Self Distributing Virtual Machine (SDVM): Making Computer Clusters Adaptive

Haase

Hofmann

Waldschmidt

IFIP International Federation for Information Processing

View full text Add to dashboard Cite

Abstract. The Self Distributing Virtual Machine (SDVM) is a middleware concept to form a parallel computing machine consisting of a any set of processing units, such as functional units in a processor or FPGA, processing units in a multiprocessor chip, or computers in a computer cluster. Its structure and functionality is biologically inspired aiming towards forming a combined workforce of independent units ("sites"), each acting on the same set of simple rules. The SDVM supports growing and shrinking the cluster at runtime as well as heterogeneous clusters. It uses the work-stealing principle to dynamically distribute the workload among all sites. The SDVM's energy management targets the health of all sites by adjusting their power states according to workload and temperature. Dynamic reassignment of the current workload facilitates a new energy policy which focuses on increasing the reliability of each site. This paper presents the structure and the functionality of the SDVM.

show abstract

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

Cited by 150 publications

References 12 publications

Proactive process-level live migration and back migration in HPC environments

Proactive process-level live migration and back migration in HPC environments

P2P-MPI: A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids

The Self Distributing Virtual Machine (SDVM): Making Computer Clusters Adaptive

Contact Info

Product

Resources

About