Proceedings IEEE International Conference on Cluster Computing CLUSTR-03 2003
DOI: 10.1109/clustr.2003.1253321
|View full text |Cite
|
Sign up to set email alerts
|

Coordinated checkpoint versus message log for fault tolerant MPI

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2005
2005
2016
2016

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 61 publications
(67 citation statements)
references
References 14 publications
0
67
0
Order By: Relevance
“…The most common fault-tolerance technique used in high performance computing is checkpointrollback-recovery [6,7,5,2]. A large body of work has studied periodic coordinated checkpointing for a single divisible application.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The most common fault-tolerance technique used in high performance computing is checkpointrollback-recovery [6,7,5,2]. A large body of work has studied periodic coordinated checkpointing for a single divisible application.…”
Section: Related Workmentioning
confidence: 99%
“…Too infrequent checkpoints lead to wasteful re-computation when a failure occurs, but too frequent checkpoints lead to overhead during failure-free periods of the application execution. Checkpointing can happen in a coordinated or uncoordinated manner, and the advantages and drawbacks of both approaches are well-documented [5]. Checkpointing can be agnostic to the application, in which case full address space images are saved as checkpoints [6,7].…”
Section: Introductionmentioning
confidence: 99%
“…Moreover the overhead induced during failure-free execution decreases the performances in not very faulty environments, such as clusters [23]. Furthermore, it can lead to the domino effect [24]: a process that rollbacks and that need a message to be replayed, asks another process to rollback. This process does, and asks another one to do so, etc.…”
Section: Number Of Rollbacksmentioning
confidence: 99%
“…Several implementations of message logging protocols over Message Passing Interface (MPI) library have been developed by Cappello et al [2,3,9]. The well known MPICH-V included the three different protocols (pessimistic, optimistic and causal) into the same library and it was one of the first fault tolerant distributions of MPI.…”
Section: Introductionmentioning
confidence: 99%