Checkpoint/Restart-Enabled Parallel Debugging

Hursey, Joshua; January, Chris; O'Connor, Mark; Hargrove, Paul; Lecomber, David; Squyres, Jeffrey M.; Lumsdaine, Andrew

doi:10.1007/978-3-642-15646-5_23

Cited by 8 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hursey et al [15] discussed creating intermediate checkpoints, so as to facilitate going back to earlier points in time in order to analyze a bug. This is similar to phase 1 of our three-phase debugging scenario, except that we also assume that a bug manifests in a crash, early termination, or a hanging process.…”

Section: Related Workmentioning

confidence: 99%

Extended Batch Sessions and Three-Phase Debugging

Garg

Cao

Arya³

et al. 2016

Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

View full text Add to dashboard Cite

Batch environments are notoriously unfriendly because it's not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale. Two strategies are proposed that take advantage of DMT-CP (Distributed MultiThreaded CheckPointing) for systemlevel checkpointing. First, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for noninteractive batch environments. Second, we review how to use the SLURM resource manager capability to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. We argue for greater use of this lesser known capability, as a means to remove the necessity for the application-specific checkpointing found in many longrunning jobs. CCS Concepts •Software and its engineering → Checkpoint / restart; Software testing and debugging;

show abstract

Section: Related Workmentioning

confidence: 99%

Extended Batch Sessions and Three-Phase Debugging

Garg

Cao

Arya³

et al. 2016

Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

View full text Add to dashboard Cite

show abstract

“…The xSim project is currently working to extend the performance toolkit to provide support for resilience investigations. Another related area is that of large-scale debugging and diagnosis for parallel HPC applications [1,7]. The challenges are similar in that you must be able to gather data about the distributed application and provide details for diagnosis to identify the cause of the error.…”

Section: Related Workmentioning

confidence: 99%

Using Performance Tools to Support Experiments in HPC Resilience

Naughton

Bohm

Engelmann

et al. 2014

Euro-Par 2013: Parallel Processing Workshops

View full text Add to dashboard Cite

Abstract. The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between "performance tools" and "resilience tools". As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerances.

show abstract

“…In automatic error recovery applications, memory checkpointing enables fast and safe recovery to known and stable program states [20,22,23,32,39,53,54,57,58,62,70]. In debugging applications, it enables users to efficiently navigate through several program states observed during the execution, while empowering advanced debugging techniques such as reverse/replay debugging [27,34,60,61]. Memory checkpointing also serves as a key enabling technology for important first-class programming abstractions like software transactional memory [39], application-level backtracking [11,76], and periodic memory rejuvenation [68].…”

Section: Introductionmentioning

confidence: 99%

Speculative Memory Checkpointing

Vogt

Miraglia

Portokalidis

et al. 2015

Proceedings of the 16th Annual Middleware Conference

View full text Add to dashboard Cite

High-frequency memory checkpointing is an important technique in several application domains, such as automatic error recovery (where frequent checkpoints allow the system to transparently mask failures) and application debugging (where frequent checkpoints enable fast and accurate time-traveling support). Unfortunately, existing (typically incremental) checkpointing frameworks incur substantial performance overhead in high-frequency memory checkpointing applications, thus discouraging their adoption in practice.This paper presents Speculative Memory Checkpointing (SMC ), a new low-overhead technique for high-frequency memory checkpointing. Our motivating analysis identifies key bottlenecks in existing frameworks and demonstrates that the performance of traditional incremental checkpointing strategies in high-frequency checkpointing scenarios is not optimal. To fill the gap, SMC relies on working set estimation algorithms to eagerly checkpoint the memory pages that belong to the writable working set of the running program and only lazily checkpoint the memory pages that do not. Our experimental results demonstrate that SMC is effective in reducing the performance overhead of prior solutions, is robust to variations in the workload, and incurs modest memory overhead compared to traditional incremental checkpointing.

show abstract

Checkpoint/Restart-Enabled Parallel Debugging

Cited by 8 publications

References 16 publications

Extended Batch Sessions and Three-Phase Debugging

Extended Batch Sessions and Three-Phase Debugging

Using Performance Tools to Support Experiments in HPC Resilience

Speculative Memory Checkpointing

Contact Info

Product

Resources

About