2006
DOI: 10.1177/1094342006067469
|View full text |Cite
|
Sign up to set email alerts
|

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

Abstract: High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
72
0
1

Year Published

2007
2007
2016
2016

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 107 publications
(73 citation statements)
references
References 29 publications
(72 reference statements)
0
72
0
1
Order By: Relevance
“…Typical values for ρ lie in the interval [1; 2], meaning that re-execution time can be reduced by up to half for some applications [16]. Fortunately, the introduction of λ and ρ is not difficult to account for in the expression of the expected waste: in Equation (25), we replace WORK by λ WORK and RE-EXEC by RE-EXEC ρ and obtain…”
Section: Work Timementioning
confidence: 99%
“…Typical values for ρ lie in the interval [1; 2], meaning that re-execution time can be reduced by up to half for some applications [16]. Fortunately, the introduction of λ and ρ is not difficult to account for in the expression of the expected waste: in Equation (25), we replace WORK by λ WORK and RE-EXEC by RE-EXEC ρ and obtain…”
Section: Work Timementioning
confidence: 99%
“…Often global snapshots are used for fault recovery, but, as this paper demonstrates, can also be used to support reverse execution while debugging the MPI application. For an analysis of the performance implications of integrating C/R into an MPI implementation we refer the reader to previous literature on the subject [9,11].…”
Section: Related Workmentioning
confidence: 99%
“…The state of the communication channels is usually captured by a C/R-enabled MPI implementation, such as Open MPI [6]. Although C/R is not part of the MPI standard, it is often provided as a transparent service by MPI implementations [6][7][8][9]. The state of the process is captured by a Checkpoint/Restart Service (CRS), such as Berkeley Lab Checkpoint/Restart (BLCR) [10].…”
Section: Related Workmentioning
confidence: 99%
“…More recently, MPICH-V [3] system support a wide range of fault tolerance protocols. Its generic framework covers coordinated, uncoordinated, pessimistic logging and causal logging, etc.…”
Section: Related Workmentioning
confidence: 99%
“…As the systems scale, these systems have become increasingly fragile; an ideal MPI implementation thus must embody fault tolerance properties that support long execution times in fragile underlying execution environments, while being easy and flexible to use as well as being portable and adaptable of their execution environment and fault characteristics. Previous fault tolerant MPI implementations such as LAM/MPI [16] and MPICH-V [3] are easy to use for the user in that fault tolerance is largely transparent to the programmer, but their recovery protocol is fixed and thus not adaptable. This may not be desirable, as for example, the desirable recovery method is obviously different when a fault is transient software one confined in a single process, versus a repetitive one caused by a faulty hardware.…”
Section: Introductionmentioning
confidence: 99%