2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) 2008
DOI: 10.1109/ccgrid.2008.109
|View full text |Cite
|
Sign up to set email alerts
|

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Abstract: For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
51
0
1

Year Published

2010
2010
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 64 publications
(52 citation statements)
references
References 12 publications
0
51
0
1
Order By: Relevance
“…Incremental checkpoints reduce the number of full checkpoints taken by periodically saving changes in the application data [49], [50], [51]. These approaches are orthogonal to multilevel checkpoints and can be used in combination with our work.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Incremental checkpoints reduce the number of full checkpoints taken by periodically saving changes in the application data [49], [50], [51]. These approaches are orthogonal to multilevel checkpoints and can be used in combination with our work.…”
Section: Related Workmentioning
confidence: 99%
“…These approaches are orthogonal to multilevel checkpoints and can be used in combination with our work. The checkpoint and rollback technique [51] has been widely used in distributed systems. High availability can be offered by using it and suitable failover algorithms.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…If there is something wrong in the checkpoint image or in the checkpoint stage, the application cannot recover. To avoid this, we could take more backup checkpoints (full checkpoint) after several continuous incremental checkpoints, as Naksinehaboon et al [17] do.…”
Section: Design and Implementation Of Ag-ckptmentioning
confidence: 99%
“…The table shows that (1) the cost caused by restart from one full plus one incremental checkpoints (which is R f +1 -R f ) is low, compared to the savings by replacing full checkpoints with incremental ones (which is O f -O i ), and can be ignored for most of the benchmarks; (2) the restart cost is nearly proportional to the file size (except that some pages are checkpointed twice at both full and incremental checkpoints but later only restored once and thus lead to no extra cost); (3) for all the benchmarks, we can benefit from the hybrid full/incremental C/R mechanism, and the performance improvement depends on the memory access characteristics of the application. Naksinehaboon et al provide a model that aims at reducing full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints [15]. They further develop a method to determine the optimal number of incremental checkpoints between full checkpoints.…”
Section: Benefits Of the Hybrid C/r Mechanismmentioning
confidence: 99%