Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors

Benoît, Anne; Cavelan, Aurélien; Robert, Yves; Sun, Hongyang

doi:10.1109/ipdps.2016.39

Cited by 21 publications

(25 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…which is consistent with the results obtained in [2,6,7], provided that a reliable silent error detector is available. However, as mentioned previously, such a detector is only known in some application-specific domains.…”

Section: General Process Replicationsupporting

confidence: 91%

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoît

Cavelan

Cappello

et al. 2018

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Section: General Process Replicationsupporting

confidence: 91%

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoît

Cavelan

Cappello

et al. 2018

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

“…Di et al [12] analyzed a two-level computational pattern, and proved that equal-length checkpointing segments constitute the optimal solution. Benoit et al [3] relied on disk checkpoints to cope with fail-stop failures and used memory checkpoints coupled with error detectors to handle silent data corruptions. They derived first-order approximation formulas for the optimal pattern length as well as the number of memory checkpoints between two disk checkpoints.…”

Section: Checkpointingmentioning

confidence: 99%

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Benoît

Cavelan

Ciorba

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

This report combines checkpointing and replication for the reliable execution of linear workflows. While both methods have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in failure-prone environments. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques lead to improved performance.

show abstract

“…Checkpointing with rollback recovery [17,23] is the de-facto general-purpose recovery technique in high-performance computing. Finding the optimal checkpointing interval [7,19,21,49] or the optimal recovery method for SPH codes is beyond the scope of this paper.…”

Section: Error Correctionmentioning

confidence: 99%

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Cavelan

Cabezón

Ciorba

2019

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Self Cite

View full text Add to dashboard Cite

Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on an HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.

show abstract

Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors

Cited by 21 publications

References 21 publications

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Contact Info

Product

Resources

About