Evaluating the viability of process replication reliability for exascale systems

Ferreira, Kurt Brian; Stearley, Jon; Laros, James H.; Oldfield, Ron A.; Pedretti, Kevin; Brightwell, Ron; Riesen, Rolf; Bridges, Patrick G.; Arnold, Dorian

doi:10.1145/2063384.2063443

Cited by 189 publications

(254 citation statements)

References 30 publications

Supporting

Mentioning

252

Contrasting

Unclassified

Order By: Relevance

“…We consider an application that executes for a week when there is neither a fault tolerance mechanism nor any failure. The time required to take a checkpoint and rollback the whole application is 10 minutes (C, R), a consistent order of magnitude for current applications at large scale [5]. We consider that the ratio of the memory that is modified by the Library phase (ρ) is fixed at 0.8 (to vary a single parameter at a time in our simulation), and the overhead due to ABFT is φ = 1.03 (again, typical from production deployments [9]).…”

Section: Validationmentioning

confidence: 94%

“…Checkpointing strategies are numerous, ranging from fully coordinated checkpointing [14] to uncoordinated checkpoint and recovery with message logging [15]. Despite a very broad applicability, all these fault tolerance methods suffer from the intrinsic limitation that both protection and recovery generate an I/O workload that grows with failure probability, and becomes unsustainable at large scale [5,6] (even when considering optimizations such as diskless or incremental checkpointing [16]). …”

Section: Related Workmentioning

confidence: 99%

“…Checkpoints generate a significant amount of I/O traffic and often block the progression of the application; in addition, they must be taken more and more often as the MTBF decreases in order to enable steady progress of the application. Analytical projections clearly show that sustaining Exascale computing solely with checkpointing will prove challenging [5,6].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

View full text Add to dashboard Cite

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFTaware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic programming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.

show abstract

Section: Validationmentioning

confidence: 94%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

View full text Add to dashboard Cite

show abstract

“…Replication remains the most transparent and least intrusive technique and can be used at different levels (duplication, triplication or even more) . Combined with checkpointing, replication comes with two flavors: process replication [24,25] and group replication [26]. Process replication applies to message-passing applications with communicating processes.…”

Section: Related Workmentioning

confidence: 99%

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Cavelan

Fang

Chien

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.

show abstract

“…Moreover, the most popular programming paradigm for HPC, MPI, assumes all interruptions, including single core failures, are fatal to the entire parallel application [4]. It has been identified that as systems grow, failure rates will reach a level that will render current resiliency models ineffective [5].…”

Section: Introductionmentioning

confidence: 99%

Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime

Tock

Mandler

Moreira

et al. 2013

Euro-Par 2013 Parallel Processing

View full text Add to dashboard Cite

Abstract. As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes.

show abstract

Evaluating the viability of process replication reliability for exascale systems

Cited by 189 publications

References 30 publications

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime

Contact Info

Product

Resources

About