Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Naksinehaboon, Nichamon; Liu, Yudan; Leangsuksun, Chokchai; Nassar, Raja; Păun, Mihaela; Scott, Stephen L.

doi:10.1109/ccgrid.2008.109

Cited by 64 publications

(52 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Incremental checkpoints reduce the number of full checkpoints taken by periodically saving changes in the application data [49], [50], [51]. These approaches are orthogonal to multilevel checkpoints and can be used in combination with our work.…”

Section: Related Workmentioning

confidence: 99%

“…These approaches are orthogonal to multilevel checkpoints and can be used in combination with our work. The checkpoint and rollback technique [51] has been widely used in distributed systems. High availability can be offered by using it and suitable failover algorithms.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, the overheads presented due to checkpointing should need to be reduced. Much of the previous work [51] present measurements of checkpoint latency and overhead for a few applications. Several models that define the optimal checkpoint interval have been proposed by different researchers.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency

Singh¹,

Chhabra²,

Singh³

2013

Int. J. Com. Dig. Sys.

View full text Add to dashboard Cite

Main objective of this research work is to improve the checkpoint efficiency for integrated multilevel checkpointing algorithms (IMLCA) and prevent checkpointing from becoming the bottleneck of cloud data centers. In order to find an efficient checkpoint interval, checkpointing overheads has also considered in this paper. Traditional checkpointing methods stores persistently snapshots of the present job state and use them for resuming the execution at a later time. The attention of this research is strategies for deciding when and whether a checkpoint should be taken and evaluating them in regard to minimizing the induced monetary costs. By varying rerun time of checkpoints performance comparisons are which will be used to evaluate optimal checkpoint interval. The purposed fail-over strategy will work on application layer and provide highly availability for Platform as a Service (PaaS) feature of cloud computing.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency

Singh¹,

Chhabra²,

Singh³

2013

Int. J. Com. Dig. Sys.

View full text Add to dashboard Cite

show abstract

“…If there is something wrong in the checkpoint image or in the checkpoint stage, the application cannot recover. To avoid this, we could take more backup checkpoints (full checkpoint) after several continuous incremental checkpoints, as Naksinehaboon et al [17] do.…”

Section: Design and Implementation Of Ag-ckptmentioning

confidence: 99%

Understanding the Impact of BPRAM on Incremental Checkpoint

Lü

Wang

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYExisting large-scale systems suffer from various hardware/software failures, motivating the research of fault-tolerance techniques. Checkpoint-restart techniques are widely applied fault-tolerance approaches, especially in scientific computing systems. However, the overhead of checkpoint largely influences the overall system performance. Recently, the emerging byte-addressable, persistent memory technologies, such as phase change memory (PCM), make it possible to implement checkpointing in arbitrary data granularity. However, the impact of data granularity on the checkpointing cost has not been fully addressed. In this paper, we investigate how data granularity influences the performance of a checkpoint system. Further, we design and implement a high-performance checkpoint system named AG-ckpt. AG-ckpt is a hybrid-granularity incremental checkpointing scheme through: (1) lowcost modified-memory detection and (2) fine-grained memory duplication. Moreover, we also formulize the performance-granularity relationship of checkpointing systems through a mathematical model, and further obtain the optimum solutions. We conduct the experiments through several typical benchmarks to verify the performance gain of our design. Compared to conventional incremental checkpoint, our results show that AG-ckpt can reduce checkpoint data amount up to 50% and provide a speedup of 1.2x-1.3x on checkpoint efficiency.

show abstract

“…The table shows that (1) the cost caused by restart from one full plus one incremental checkpoints (which is R f +1 -R f ) is low, compared to the savings by replacing full checkpoints with incremental ones (which is O f -O i ), and can be ignored for most of the benchmarks; (2) the restart cost is nearly proportional to the file size (except that some pages are checkpointed twice at both full and incremental checkpoints but later only restored once and thus lead to no extra cost); (3) for all the benchmarks, we can benefit from the hybrid full/incremental C/R mechanism, and the performance improvement depends on the memory access characteristics of the application. Naksinehaboon et al provide a model that aims at reducing full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints [15]. They further develop a method to determine the optimal number of incremental checkpoints between full checkpoints.…”

Section: Benefits Of the Hybrid C/r Mechanismmentioning

confidence: 99%