2012
DOI: 10.2172/1081941
|View full text |Cite
|
Sign up to set email alerts
|

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Abstract: Abstract-Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detectin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
101
0
3

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 116 publications
(104 citation statements)
references
References 9 publications
0
101
0
3
Order By: Relevance
“…If a hard failure occurs it is not straight forward to continue the computation. The default way to handle such faults is a rollback to a previous checkpoint, which will be more and more expensive with increasing parallelism not only because of recomputation but also because of communication [13], [17]- [19]. In addition the communicator has to be re-established with replacement processes, or the application has to be repartitioned and/or load-balanced.…”
Section: A Faults and Failuresmentioning
confidence: 99%
“…If a hard failure occurs it is not straight forward to continue the computation. The default way to handle such faults is a rollback to a previous checkpoint, which will be more and more expensive with increasing parallelism not only because of recomputation but also because of communication [13], [17]- [19]. In addition the communicator has to be re-established with replacement processes, or the application has to be repartitioned and/or load-balanced.…”
Section: A Faults and Failuresmentioning
confidence: 99%
“…The simplest technique is triple modular redundancy and voting [19], which induces a costly verification. For high-performance scientific applications, process replication (each process is equipped with a replica, and messages are quadruplicated) is proposed in the RedMPI library [20]. Elliot et al [21] combine partial redundancy and checkpointing, and confirm the benefit of dual and triple redundancy.…”
Section: Silent Errorsmentioning
confidence: 99%
“…A high precision and a high recall indicate both few false-positives and a good detection rate, respectively. In general, detectors that either employ full replication of entire applications [27] or selective replication of parts of an application [10] offer the highest precision and recall. However, they are often prohibitively expensive in terms of additional required computing resources and time.…”
Section: Introductionmentioning
confidence: 99%