2013
DOI: 10.1007/s00354-013-0302-4
|View full text |Cite
|
Sign up to set email alerts
|

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Abstract: The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storag… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
8

Relationship

2
6

Authors

Journals

citations
Cited by 20 publications
(13 citation statements)
references
References 35 publications
0
12
0
Order By: Relevance
“…However, the checkpoint's size is too large. Incremental checkpointing [11,21,22,23,12,24] only saves the modified information as compared to the previous checkpoint. This technique reaches advantages of reducing checkpoint's overhead and checkpoint's size, so it is in widely used in distributed computing.…”
Section: Discontinuous Incremental Checkpointing On Capementioning
confidence: 99%
See 2 more Smart Citations
“…However, the checkpoint's size is too large. Incremental checkpointing [11,21,22,23,12,24] only saves the modified information as compared to the previous checkpoint. This technique reaches advantages of reducing checkpoint's overhead and checkpoint's size, so it is in widely used in distributed computing.…”
Section: Discontinuous Incremental Checkpointing On Capementioning
confidence: 99%
“…This technique reaches advantages of reducing checkpoint's overhead and checkpoint's size, so it is in widely used in distributed computing. Besides, using data compression to reduce checkpoint's size [11,21,24], it is also focus on the techniques that detect modified data but reach the minimum of size. Some typical techniques are using page-based protection to identify the pages in memory that have been modified [11,22,23], using word-level granularity [21,12], using block encoding [22], using user-directed and memory exclusion [11], using live variable analysis [24].…”
Section: Discontinuous Incremental Checkpointing On Capementioning
confidence: 99%
See 1 more Smart Citation
“…To further reduce the overheads of the checkpoint recovery process, Cores et al . [28] carry out the study on how to reduce the size of the checkpoint files. Also, for large-scale distributed systems, Wei et al .…”
Section: Related Workmentioning
confidence: 99%
“…We refer to the literature (e.g. Bautista-Gomez and Cappello, 2015;Berrocal et al, 2015Berrocal et al, , 2016Turnbull and Alldrin, 2003) for fault detection and prediction, as well as for other techniques, such as advanced focused checkpointing strategies (Cores et al, 2013;Islam et al, 2013;Kohl et al, 2019;Losada et al, 2019;Plank et al, 1995;Rodríguez et al, 2010;Sancho et al, 2004;Tao et al, 2018), selective identification of parts of models in need of reliability (Bridges et al, 2012;Hoemmen et al, 2011), self-stabilizing iterative solvers (Sao and Vuduc, 2013), data compression techniques (Di and Cappello, 2016), global view resilience (Chien et al, 2015), and resilience in the framework of domain decomposition preconditioners (Rizzi et al, 2018a(Rizzi et al, , 2018bSargsyan et al, 2015).…”
Section: Introductionmentioning
confidence: 99%