ACM/IEEE SC 2002 Conference (SC'02) 2002
DOI: 10.1109/sc.2002.10048
|View full text |Cite
|
Sign up to set email alerts
|

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

Abstract: ISBN: 0-7695-152International audienceGlobal Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture reli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
128
0
2

Year Published

2005
2005
2019
2019

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 149 publications
(130 citation statements)
references
References 12 publications
0
128
0
2
Order By: Relevance
“…In the context of HPC, many MPI implementations have been retrofitted with or design for FT, ranging from automatic methods (checkpoint-based or log-based) [44], [41], [5] to nonautomated approaches [3], [17].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In the context of HPC, many MPI implementations have been retrofitted with or design for FT, ranging from automatic methods (checkpoint-based or log-based) [44], [41], [5] to nonautomated approaches [3], [17].…”
Section: Related Workmentioning
confidence: 99%
“…Logbased methods exploit messages logging and optionally their temporal ordering, where the latter is required for asynchronous non-coordinated checkpointing. MPICH-V [5] implements three such protocols. It uses Condor's userlevel checkpoint library [29].…”
Section: Related Workmentioning
confidence: 99%
“…The MPICH-V1 [7] After a crash, a re-executed process retrieves all lost receptions in the correct order by requesting them to its associated channel memory. The logging has however a major impact on the performance (bandwidth divided by 2) and requires a large number of channel memories.…”
Section: Performancesmentioning
confidence: 99%
“…For large scale machines like the ASCI-Q machine, the mean time between failures (MTBF) for the whole system is estimated to be mere hours [1]. Thus system stability even in the face of failure of single components is an important goal.…”
Section: Introductionmentioning
confidence: 99%