2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing 2012
DOI: 10.1109/pdp.2012.22
|View full text |Cite
|
Sign up to set email alerts
|

File I/O for MPI Applications in Redundant Execution Scenarios

Abstract: As multi-petascale and exa-scale highperformance computing (HPC) systems inevitably have to deal with a number of resilience challenges, such as a significant growth in component count and smaller circuit sizes with lower circuit voltages, redundancy may offer an acceptable level of resilience that traditional fault tolerance techniques, such as checkpoint/restart, do not. Although redundancy in HPC is quite controversial due to the associated cost for redundant components, the constantly increasing number of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2012
2012
2019
2019

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 11 publications
(10 reference statements)
0
5
0
Order By: Relevance
“…If an algorithm lacks a simple checking method or invariant, the Checker can be provided through comparison with a checksum over the data that was computed beforehand and stored in a safe region. 2 The Recover method can be supplied through the forward recovery phase in ABFT methods, or simply by restoring a light-weight deduplicated [1] or compressed [17] checkpoint of the data.…”
Section: Assumptionsmentioning
confidence: 99%
See 1 more Smart Citation
“…If an algorithm lacks a simple checking method or invariant, the Checker can be provided through comparison with a checksum over the data that was computed beforehand and stored in a safe region. 2 The Recover method can be supplied through the forward recovery phase in ABFT methods, or simply by restoring a light-weight deduplicated [1] or compressed [17] checkpoint of the data.…”
Section: Assumptionsmentioning
confidence: 99%
“…CR and redundant computing already ensure idem-potency since identical state is restored in the former while redundant state exists for the latter, but ABFT methods have to be complemented, e.g., by compiler-driven live variable analysis to capture/restore globals at region boundaries. Existing solutions to I/O idem-potency are required as well [2]. We can then retry a computation if needed, i.e., when no other recovery methods exist (or if the other recovery methods have failed).…”
Section: Assumptionsmentioning
confidence: 99%
“…As the primary focus of this work is to investigate redundancy as a means to protect application data, RedMPI only attempts to protect against corruption in data and not application code or instructions. RedMPI does not protect MPI I/O functionality, but orthogonal work [22]…”
Section: B Assumptionsmentioning
confidence: 99%
“…Note that checkpointing, recovery are not implemented in the current prototype. Since I/O operations are often used to save intermediate results and implement application-level checkpointing, we plan to integrate application level checkpointing using the solution proposed in [1] to handle IO in a replicated MPI application.…”
Section: Sdr-mpi Is Integrated Into Open Mpimentioning
confidence: 99%
“…The study presented in [9] is the first showing that active replication could outperform coordinated checkpointing at scale 1 . In that work, replication is combined with coordinated checkpointing: If each process is replicated, the probably that the application needs to be restarted from a checkpoint, meaning that all replicas of one process have failed, is dramatically reduced compared to a scenario without replication.…”
Section: Introductionmentioning
confidence: 99%