Optimizing Checkpoint Sizes in the C3 System

Marques, Daniel; Bronevetsky, Greg; Fernandes, Rohit; Pingali, Keshav; Stodghill, Paul

doi:10.1109/ipdps.2005.316

Cited by 14 publications

(16 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To checkpoint, prevailing software approaches [10,11,15,30,39,46] first impose a barrier on all threads, and then record their states (Figure 3(a)). A second barrier is used to ensure that a thread continues the execution only after all others have checkpointed, to prevent the states they are recording from being modified.…”

Section: Recovery From Global Exceptionsmentioning

confidence: 99%

“…Upon exception, they recover to a prior error-free state and resume the program, losing all work completed since. A plethora of hardware [3,34,37,43] and software [10,11,15,27,30,39,46] approaches, striking trade-offs between complexity and overheads, have been proposed in the literature (Table 1: rows 1, 2). Our qualitative analysis shows that their checkpointing and recovery processes will be too inefficient to handle frequent exceptions.…”

Section: Introductionmentioning

confidence: 99%

“…Modern processors execute a sequential program's instructions in parallel, yet handle exceptions efficiently. They exploit the implicit order between the program's [3], ReViveI/O [34], ReVive [37], SafetyNet [43] Yes Hardware High High No No N/A 2 [10,11,15], C 3 [30], [39,46] User code Software High High No No N/A 3 DMP [13], RCDC [14], Calvin [24] No Hardware N/A N/A N/A Yes High 4 dOS [7], CoreDet [6], Grace [8], DTHREADS [28], Kendo [35] No Software N/A N/A N/A Yes High 5 GPRS (this work)…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Globally precise-restartable execution of parallel programs

Gupta

Sridharan

Sohi

2014

Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Emerging trends in computer design and use are likely to make exceptions, once rare, the norm, especially as the system size grows. Due to exceptions, arising from hardware faults, approximate computing, dynamic resource management, etc., successful and errorfree execution of programs may no longer be assured. Yet, designers will want to tolerate the exceptions so that the programs execute completely, efficiently and without external intervention.Modern computers easily handle exceptions in sequential programs, using precise interrupts. But they are ill-equipped to handle exceptions in parallel programs, which are growing in prevalence. In this work we introduce the notion of globally preciserestartable execution of parallel programs, analogous to preciseinterruptible execution of sequential programs. We present a software runtime recovery system based on the approach to handle exceptions in suitably-written parallel programs. Qualitative and quantitative analyses show that the proposed system scales with the system size, especially when exceptions are frequent, unlike the conventional checkpoint-and-recovery method.

show abstract

Section: Recovery From Global Exceptionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Globally precise-restartable execution of parallel programs

Gupta

Sridharan

Sohi

2014

Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

show abstract

“…Also, the checkpoints are taken at a time when the application memory footprint is small. Another approach proposed by Marques et al [28] dynamically partitions objects of the program into subheaps in memory. By specifying how the checkpoint mechanism treat objects in different subheaps as always save, never save and once save, they reduce the checkpoint size at runtime.…”

Section: Related Workmentioning

confidence: 99%

Supporting fault-tolerance in streaming grid applications

Zhu¹,

Chen²,

Agrawal³

2008

2008 IEEE International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-This paper considers the problem of supporting and efficiently implementing fault-tolerance for tightly-coupled and pipelined applications, especially streaming applications, in a grid environment. We provide an alternative to basic checkpointing and use the notion of Light-weight Summary Structure(LSS) to enable efficient failure-recovery. The idea behind LSS is that at certain points during the execution of a processing stage, the state of the program can be summarized by a small amount of memory. This allows us to store copies of LSS for enabling failure-recovery, which causes low overhead fault-tolerance. Our work can be viewed as an optimization and adaptation of the idea of application-level checkpointing to a different execution environment, and for a different class of applications.Our implementation and evaluation of LSS based failurerecovery has been in the context of the GATES (Grid-based AdapTive Execution on Streams) middleware. An observation we use for providing very low overhead support for fault-tolerance is that algorithms analyzing data streams are only allowed to take a single pass over data, which means they only perform approximate processing. Therefore, we believe that in supporting fault-tolerant execution for these applications, it is acceptable to not analyze a small number of packets of data during failure-recovery. We show how we perform failure-recovery and also demonstrate how we could use additional buffers to limit data loss during the recovery procedure. We also present an efficient algorithm for allocating a new computation resource for failure-recovery at runtime. We have extensively evaluated our implementation using three stream data processing applications, and shown that the use of LSS allows effective and low-overhead failure-recovery.

show abstract

“…In our approach, a static analysis is done at compile time to compute information that can be fed to the runtime system to reduce the checkpointing overhead. In [21] we describe how we have added functions to our heap implementation that allows heap objects to be partitioned into "colors". There are additional functions for assigning checkpointing policies to each color (e.g., "Never save this color" or "Save this color only once").…”

Section: Automatic Application-level Checkpointingmentioning

confidence: 99%

Recent advances in checkpoint/recovery systems

Bronevetsky

Fernandes

Marques

et al. 2006

Proceedings 20th IEEE International Parallel &Amp; Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Checkpoint and Recovery (CPR) systems have many usesin high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines.The paper is a status report on our work on the family of C 3 systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements.

show abstract

Optimizing Checkpoint Sizes in the C3 System

Cited by 14 publications

References 14 publications

Globally precise-restartable execution of parallel programs

Globally precise-restartable execution of parallel programs

Supporting fault-tolerance in streaming grid applications

Recent advances in checkpoint/recovery systems

Contact Info

Product

Resources

About