2010
DOI: 10.1007/978-3-642-15646-5_23
|View full text |Cite
|
Sign up to set email alerts
|

Checkpoint/Restart-Enabled Parallel Debugging

Abstract: Abstract. Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 16 publications
0
5
0
Order By: Relevance
“…Hursey et al [15] discussed creating intermediate checkpoints, so as to facilitate going back to earlier points in time in order to analyze a bug. This is similar to phase 1 of our three-phase debugging scenario, except that we also assume that a bug manifests in a crash, early termination, or a hanging process.…”
Section: Related Workmentioning
confidence: 99%
“…Hursey et al [15] discussed creating intermediate checkpoints, so as to facilitate going back to earlier points in time in order to analyze a bug. This is similar to phase 1 of our three-phase debugging scenario, except that we also assume that a bug manifests in a crash, early termination, or a hanging process.…”
Section: Related Workmentioning
confidence: 99%
“…The xSim project is currently working to extend the performance toolkit to provide support for resilience investigations. Another related area is that of large-scale debugging and diagnosis for parallel HPC applications [1,7]. The challenges are similar in that you must be able to gather data about the distributed application and provide details for diagnosis to identify the cause of the error.…”
Section: Related Workmentioning
confidence: 99%
“…In automatic error recovery applications, memory checkpointing enables fast and safe recovery to known and stable program states [20,22,23,32,39,53,54,57,58,62,70]. In debugging applications, it enables users to efficiently navigate through several program states observed during the execution, while empowering advanced debugging techniques such as reverse/replay debugging [27,34,60,61]. Memory checkpointing also serves as a key enabling technology for important first-class programming abstractions like software transactional memory [39], application-level backtracking [11,76], and periodic memory rejuvenation [68].…”
Section: Introductionmentioning
confidence: 99%