FlipTracker: Understanding Natural Error Resilience in HPC Applications

Guo, Luanzheng; Liu, Dong; Laguna, Ignacio; Schulz, Martin

doi:10.1109/sc.2018.00011

Cited by 22 publications

(9 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For (3), we use crash tests, but we can avoid them by an application characterization study. In particular, we can detect computation patterns that tolerate computation inaccuracy as in [25]. Then we set up a model to correlate those patterns and application recomputability.…”

Section: Discussionmentioning

confidence: 99%

“…For (2), when the application outcome is different from that of the golden run, the users can claim a silent data corruption (SDC) happens [25,70]. With the acceptance verification, many applications treat this kind of SDC as benign and ignorable.…”

Section: Discussionmentioning

confidence: 99%

“…The times when the execution is stopped follow a discrete uniform distribution. This method of interrupting applications is common in the research on system fault tolerance [10,24,25,39,70].…”

Section: Experiments Setupmentioning

confidence: 99%

“…Persisting data objects in each code region ensures that the most recent computation results in a phase are persistent in NVM, and can effectively improve application recomputability. The similar definition of code regions can be found in [25] to study application resilience to errors.…”

Section: Maximum Recomputability Of the Code Region K After Persistin...mentioning

confidence: 99%

See 3 more Smart Citations

Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Ren

Liu

2020

2020 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54% of crashes that cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5% on average). Using Easy-Crash and application intrinsic fault tolerance, 82% of crashes can successfully recompute. When EasyCrash is used with a traditional checkpoint scheme, it enables up to 24% improvement (15% on average) in system efficiency.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Experiments Setupmentioning

confidence: 99%

Section: Maximum Recomputability Of the Code Region K After Persistin...mentioning

confidence: 99%

See 2 more Smart Citations

Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Ren

Liu

2020

2020 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

show abstract

“…These crucial facts lead to increasing importance of and challenges for developing efficient and effective fault tolerance designs for scaling HPC systems [4], [5]. There are numerous fault tolerance techniques proposed to protect MPI application execution from system failures.…”

Section: Introductionmentioning

confidence: 99%

MATCH: An MPI Fault Tolerance Benchmark Suite

Guo

Georgakoudis

Parasyris

et al. 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

Self Cite

View full text Add to dashboard Cite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

show abstract