Memory exclusion: optimizing the performance of checkpointing systems

Plank, James S.; Chen, Yuqun; Li, Kai; Beck, Micah; Kingsley, Gerry

doi:10.1002/(sici)1097-024x(199902)29:2<125::aid-spe224>3.0.co;2-7

Cited by 64 publications

(48 citation statements)

References 26 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second set of strategies reduce commit latencies by reducing checkpoint sizes. These strategies include memory exclusion [19] and incremental checkpointing [20]- [22]. In Section V, we discuss the potential interplay between these optimizations and checkpoint compression.…”

Section: A Checkpoint Optimizationsmentioning

confidence: 99%

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Ibtesham

Arnold

Bridges

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact checkpoint compression might have on projected extreme scale systems.

show abstract

Section: A Checkpoint Optimizationsmentioning

confidence: 99%

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Ibtesham

Arnold

Bridges

et al. 2012

2012 41st International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Hardware checkpointing resources are also rarely exposed to software, and are even less often configurable in terms of their checkpointing granularity, limiting their wider applicability. Finally, checkpoints have limited application visibility and are often overly aggressive in saving more state than is required by the application [9,27].…”

Section: Introductionmentioning

confidence: 99%

Static analysis and compiler design for idempotent processing

Kruijf

Sankaralingam

Jha

2012

Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Recovery functionality has many applications in computing systems, from speculation recovery in modern microprocessors to fault recovery in high-reliability systems. Modern systems commonly recover using checkpoints. However, checkpoints introduce overheads, add complexity, and often save more state than necessary.This paper develops a novel compiler technique to recover program state without the overheads of explicit checkpoints. The technique breaks programs into idempotent regions-regions that can be freely re-executed-which allows recovery without checkpointed state. Leveraging the property of idempotence, recovery can be obtained by simple re-execution. We develop static analysis techniques to construct these regions and demonstrate low overheads and large region sizes for an LLVM-based implementation. Across a set of diverse benchmark suites, we construct idempotent regions close in size to those that could be obtained with perfect runtime information. Although the resulting code runs more slowly, typical performance overheads are in the range of just 2-12%.The paradigm of executing entire programs as a series of idempotent regions we call idempotent processing, and it has many applications in computer systems. As a concrete example, we demonstrate it applied to the problem of compiler-automated hardware fault recovery. In comparison to two other state-of-the-art techniques, redundant execution and checkpoint-logging, our idempotent processing technique outperforms both by over 15%.

show abstract

“…The other major direction is to reduce checkpoint overhead, especially the disk I/O time. Latency hiding and memory exclusion are two key techniques [16]. The studies in this category include copy-on-write [9], diskless checkpointing [17], and incremental checkpointing [5,20] There also exist several optimization techniques that utilize memory paging mechanisms to achieve fast process execution.…”

Section: Related Workmentioning

confidence: 99%

“…The effectiveness of FREM requires that the process only access a relatively small portion of its address space within a given time window after a checkpoint. This assumption is justified by two facts in practice: (1) many applications demonstrate good temporal locality in data accesses, and (2) applications using dynamic memory allocation may have a large amount of unused or dead data in their checkpoint image files [16].…”

Section: Main Ideamentioning

confidence: 99%