2012
DOI: 10.1109/tdmr.2012.2192736
|View full text |Cite
|
Sign up to set email alerts
|

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 36 publications
(24 citation statements)
references
References 21 publications
0
24
0
Order By: Relevance
“…Flip-flop soft errors can result in the following outcomes [26], [30], [25], [29], [31]: Vanished -normal termination and output files match error-free runs, Output Mismatch (OMM) -normal termination, but output files are different from errorfree runs, Unexpected Termination (UT) -program terminates abnormally, Hang -no termination or output within 2× the nominal execution time, Error Detection (ED) -an employed resilience technique flags an error, but the error is not recovered using a hardware recovery mechanism.…”
Section: A Reliability Analysismentioning
confidence: 99%
“…Flip-flop soft errors can result in the following outcomes [26], [30], [25], [29], [31]: Vanished -normal termination and output files match error-free runs, Output Mismatch (OMM) -normal termination, but output files are different from errorfree runs, Unexpected Termination (UT) -program terminates abnormally, Hang -no termination or output within 2× the nominal execution time, Error Detection (ED) -an employed resilience technique flags an error, but the error is not recovered using a hardware recovery mechanism.…”
Section: A Reliability Analysismentioning
confidence: 99%
“…Despite a sizable amount of research into this topic, a consensus seems yet to have been established. A reasonable estimate appears to be that the frequency of SDC type errors is roughly an order of magnitude lower than that of errors leading to node failure [10,18]. On modern day clusters, DRAM memory and CPU caches are almost always protected at the architectural level using some form of error correction.…”
Section: Numerical Experimentsmentioning
confidence: 99%
“…We use the FIT rates for crashes (DUEs) and SDCs of Michalak et al [29] for the Roadrunner supercomputer. Michalak et al obtained these rates via accelerated neutron-beam test.…”
Section: A the Estimation Of Failure Ratesmentioning
confidence: 99%
“…We use the benchmark FIT rates to calculate and specify the target reliability thresholds which are to be achieved by the App FIT heuristic. For instance, if the crash failure is 2.22 ×10 3 for 32 GBs as given in [29], then for 32 MB program input the crash failure would be 2.22, or for a task argument of 32 KB the crash failure would be 2.22 ×10 −3 . Finally a task's overall failure rates λ F (T ) and λ SDC (T ) are sum of all its arguments' failure rates respectively.…”
Section: A the Estimation Of Failure Ratesmentioning
confidence: 99%
See 1 more Smart Citation