2018
DOI: 10.14529/jsfi180102
|View full text |Cite
|
Sign up to set email alerts
|

Record-and-Replay Techniques for HPC Systems: A Survey

Abstract: Record-and-replay techniques provide the ability to record executions of nondeterministic applications and re-execute them identically. These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors such as message races. Record-and-replay techniques are highly diverse in terms of the fidelity of replay they provide, the assumptions they make about the recorded application, the programming models they target, and the runtime … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 57 publications
0
5
0
Order By: Relevance
“…In production HPC environments, record-and-replay tools allow users to record a nondeterministic application's execution and then replay it exactly, thus enabling the reproducibility of nondeterministic bugs (Chapp et al, 2018). State-of-the-art record-and-replay tools such as ReMPI (Sato et al, 2015) target production-scale runs and prioritize scalability in terms of runtime and record size.…”
Section: Software Solutions For Nondeterministic Executionsmentioning
confidence: 99%
See 2 more Smart Citations
“…In production HPC environments, record-and-replay tools allow users to record a nondeterministic application's execution and then replay it exactly, thus enabling the reproducibility of nondeterministic bugs (Chapp et al, 2018). State-of-the-art record-and-replay tools such as ReMPI (Sato et al, 2015) target production-scale runs and prioritize scalability in terms of runtime and record size.…”
Section: Software Solutions For Nondeterministic Executionsmentioning
confidence: 99%
“…The convergence of extreme hardware concurrency and the effective overlap of computation and communication in asynchronous executions are resulting in growing nondeterminism in High-Performance Computing (HPC) applications, as illustrated in Figure 1 and presented in Ahn et al (2013); Gopalakrishnan et al (2017); Sato et al (2017); Chapp et al (2015Chapp et al ( , 2018Chapp et al ( , 2021. Nondeterminism can manifest at multiple levels in the software stack: it can manifest in lowlevel communication primitives (e.g., the inherent nondeterminism of nonblocking matching functions in MPI); it can manifest in libraries (e.g., dynamic load-balancing libraries), as presented in Lusk et al (2015); or it can display at the application level (e.g., Monte-Carlo simulations).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…While capture of intermediate data products can potentially bolster efforts to achieve reproducibility, doing so necessarily comes at the cost of scalability, especially in the HPC setting. Efforts to achieve scalable record and replay of HPC applications indicate that capturing fine-grained data about the intermediate state of parallel executions remains an active and challenging area of research [13]. Hence, in our view the feasibility of Recommendation 4-1's guidelines regarding capture of intermediate data must be managed on a case-by-case basis.…”
Section: The Nasem Recommendations In the Context Of The A4md Workflowmentioning
confidence: 99%
“…For multiple parallel runtimes, to cope with evolving HPC system architectures, the use of multiple parallel runtimes (e.g., MPI + OpenMP) in a single codebase has become increasingly common in scientific computing. The effect of mixing these runtimes on application-level nondeterminism has been identified as a major challenge in the push to exascale [26], and the scarcity of tools for mitigating non-determinism in these types of codebases has been documented [13].…”
Section: Recommendation 5-1: Broadening Notions Of Uncertainty Quantimentioning
confidence: 99%