Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Fiala, David; Mueller, Frank; Engelmann, Christian; Ferreira, Kurt Brian; Brightwell, Ron; Riesen, Rolf

doi:10.2172/1081941

Cited by 116 publications

(104 citation statements)

References 9 publications

Supporting

Mentioning

101

Contrasting

Unclassified

Order By: Relevance

“…If a hard failure occurs it is not straight forward to continue the computation. The default way to handle such faults is a rollback to a previous checkpoint, which will be more and more expensive with increasing parallelism not only because of recomputation but also because of communication [13], [17]- [19]. In addition the communicator has to be re-established with replacement processes, or the application has to be repartitioned and/or load-balanced.…”

Section: A Faults and Failuresmentioning

confidence: 99%

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

Engwer

Altenbernd

Dreier

et al. 2018

2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour.In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local recovery' resilience approaches. Our prototype implementation uses MPI-3.0 features only. In addition we present a dedicated implementation, which integrates seamlessly with MPI-ULFM, i.e. the most prominent proposal for extending MPI towards fault tolerance.Our implementation is available at https://gitlab.dune-project.org/christi/test-mpi-exceptions.

show abstract

Section: A Faults and Failuresmentioning

confidence: 99%

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

Engwer

Altenbernd

Dreier

et al. 2018

2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

show abstract

“…The simplest technique is triple modular redundancy and voting [19], which induces a costly verification. For high-performance scientific applications, process replication (each process is equipped with a replica, and messages are quadruplicated) is proposed in the RedMPI library [20]. Elliot et al [21] combine partial redundancy and checkpointing, and confirm the benefit of dual and triple redundancy.…”

Section: Silent Errorsmentioning

confidence: 99%

A backward/forward recovery approach for the preconditioned conjugate gradient method

Fasi

Langou²,

Robert

et al. 2016

Journal of Computational Science

View full text Add to dashboard Cite

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations.

show abstract

“…A high precision and a high recall indicate both few false-positives and a good detection rate, respectively. In general, detectors that either employ full replication of entire applications [27] or selective replication of parts of an application [10] offer the highest precision and recall. However, they are often prohibitively expensive in terms of additional required computing resources and time.…”

Section: Introductionmentioning

confidence: 99%

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Cavelan

Cabezón

Ciorba

2019

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

View full text Add to dashboard Cite

Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on an HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.

show abstract

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Cited by 116 publications

References 9 publications

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application

A backward/forward recovery approach for the preconditioned conjugate gradient method

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Contact Info

Product

Resources

About