Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

Buntinas, Darius; Coti, Camille; Hérault, Thomas; Lemarinier, Pierre; Pilard, Laurence; Rezmerita, Ala; Rodríguez, Eric; Cappello, Franck

doi:10.1016/j.future.2007.02.002

Cited by 57 publications

(33 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One promising alternative is to use local storage (memory, SSD, local disks) [1], [2], [4]. During checkpoint, the application usually stops the execution until the checkpoint is safely stored, using what is called the blocking algorithm [5].…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

Meneses

Kalé

2012

2012 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

Abstract-The HPC community has seen a steady increase in the number of components in every generation of supercomputers. Assembling a large number of components into a single cluster makes a machine more powerful, but also much more prone to failures. Therefore, fault tolerance has become a major concern in HPC. To deal with node crashes in large systems, checkpoint/restart is by far the preferred method. A typical way to implement checkpoints is by using a blocking algorithm, which suspends the execution of the application while the checkpoint is safely stored. One limitation of the blocking algorithm is that it saturates the network bandwidth at the time of checkpoint. This problem will become even more critical because the projected network bandwidth increase will not match the increase in memory per node. To alleviate this problem, we have developed a semi-blocking checkpoint algorithm that overlaps execution of the application with transmission of checkpoints. Our implementation decomposes a checkpoint into small messages that are interleaved with application messages. The experimental results show a dramatic reduction in the checkpoint overhead for various applications. We present a model for our approach and use this model to compute the benefit of the semi-blocking algorithm for different failure rates predicted at Exascale. We estimate our method can reduce up to 22% the total execution time of an iterative scientific application.

show abstract

Section: Motivationmentioning

confidence: 99%

“…This non-blocking algorithm is totally asynchronous and runs in conjunction with the application. However, since it needs to store the in-flight messages as part of the checkpoint, it has a higher memory footprint and a non-trivial implementation [5].…”

Section: Introductionmentioning

confidence: 99%

Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

Meneses

Kalé

2012

2012 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

show abstract

“…Several algorithms have been proposed to coordinate checkpoints, the most usual being the ChandyLamport algorithm [6] and the blocking coordinated checkpointing, [5,17], which silences the network. In these algorithms, waves of tokens are exchanged to form a recovery line that eliminates orphan messages and detects in-transit messages.…”

Section: Building a Consistent Recovery Setmentioning

confidence: 99%

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Bouteiller

Hérault

Bosilca

et al. 2011

Euro-Par 2011 Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

show abstract

“…Among rollback-recovery techniques [7], sender-based message logging [1,8,20] with check pointing [2,3,6,11,14] is one of the most lightweight fault-tolerance techniques to be capable of being applied in those fields. It may considerably lower high failure-free overhead of receiver-based message logging [15,21] resulting from synchronously logging each message into stable storage, which can be realized by using volatile memory of its sender as storage for logging [1,7,8,10,20].…”

Section: Introductionmentioning

confidence: 99%

Virtual Sender-based Message Logging for Large-scale Ubiquitous Sensor Network Systems

Ahn

2014

IJMUE

View full text Add to dashboard Cite

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

Cited by 57 publications

References 6 publications

Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Virtual Sender-based Message Logging for Large-scale Ubiquitous Sensor Network Systems

Contact Info

Product

Resources

About