Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011
DOI: 10.1145/2063384.2063443
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the viability of process replication reliability for exascale systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
252
0
1

Year Published

2013
2013
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 189 publications
(254 citation statements)
references
References 30 publications
1
252
0
1
Order By: Relevance
“…We consider an application that executes for a week when there is neither a fault tolerance mechanism nor any failure. The time required to take a checkpoint and rollback the whole application is 10 minutes (C, R), a consistent order of magnitude for current applications at large scale [5]. We consider that the ratio of the memory that is modified by the Library phase (ρ) is fixed at 0.8 (to vary a single parameter at a time in our simulation), and the overhead due to ABFT is φ = 1.03 (again, typical from production deployments [9]).…”
Section: Validationmentioning
confidence: 94%
See 2 more Smart Citations
“…We consider an application that executes for a week when there is neither a fault tolerance mechanism nor any failure. The time required to take a checkpoint and rollback the whole application is 10 minutes (C, R), a consistent order of magnitude for current applications at large scale [5]. We consider that the ratio of the memory that is modified by the Library phase (ρ) is fixed at 0.8 (to vary a single parameter at a time in our simulation), and the overhead due to ABFT is φ = 1.03 (again, typical from production deployments [9]).…”
Section: Validationmentioning
confidence: 94%
“…Checkpointing strategies are numerous, ranging from fully coordinated checkpointing [14] to uncoordinated checkpoint and recovery with message logging [15]. Despite a very broad applicability, all these fault tolerance methods suffer from the intrinsic limitation that both protection and recovery generate an I/O workload that grows with failure probability, and becomes unsustainable at large scale [5,6] (even when considering optimizations such as diskless or incremental checkpointing [16]). …”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Replication remains the most transparent and least intrusive technique and can be used at different levels (duplication, triplication or even more) . Combined with checkpointing, replication comes with two flavors: process replication [24,25] and group replication [26]. Process replication applies to message-passing applications with communicating processes.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, the most popular programming paradigm for HPC, MPI, assumes all interruptions, including single core failures, are fatal to the entire parallel application [4]. It has been identified that as systems grow, failure rates will reach a level that will render current resiliency models ineffective [5].…”
Section: Introductionmentioning
confidence: 99%