SC18: International Conference for High Performance Computing, Networking, Storage and Analysis 2018
DOI: 10.1109/sc.2018.00011
|View full text |Cite
|
Sign up to set email alerts
|

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Abstract: As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns-combinations or sequences of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 22 publications
(9 citation statements)
references
References 51 publications
0
8
0
Order By: Relevance
“…For (3), we use crash tests, but we can avoid them by an application characterization study. In particular, we can detect computation patterns that tolerate computation inaccuracy as in [25]. Then we set up a model to correlate those patterns and application recomputability.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…For (3), we use crash tests, but we can avoid them by an application characterization study. In particular, we can detect computation patterns that tolerate computation inaccuracy as in [25]. Then we set up a model to correlate those patterns and application recomputability.…”
Section: Discussionmentioning
confidence: 99%
“…For (2), when the application outcome is different from that of the golden run, the users can claim a silent data corruption (SDC) happens [25,70]. With the acceptance verification, many applications treat this kind of SDC as benign and ignorable.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…These crucial facts lead to increasing importance of and challenges for developing efficient and effective fault tolerance designs for scaling HPC systems [4], [5]. There are numerous fault tolerance techniques proposed to protect MPI application execution from system failures.…”
Section: Introductionmentioning
confidence: 99%