Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Cores, Iván; Rodriguez, Giovanna; Martín, María J.; González, Patricia; Osorio, Roberto R.

doi:10.1007/s00354-013-0302-4

Cited by 20 publications

(13 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the checkpoint's size is too large. Incremental checkpointing [11,21,22,23,12,24] only saves the modified information as compared to the previous checkpoint. This technique reaches advantages of reducing checkpoint's overhead and checkpoint's size, so it is in widely used in distributed computing.…”

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

“…This technique reaches advantages of reducing checkpoint's overhead and checkpoint's size, so it is in widely used in distributed computing. Besides, using data compression to reduce checkpoint's size [11,21,24], it is also focus on the techniques that detect modified data but reach the minimum of size. Some typical techniques are using page-based protection to identify the pages in memory that have been modified [11,22,23], using word-level granularity [21,12], using block encoding [22], using user-directed and memory exclusion [11], using live variable analysis [24].…”

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

“…Besides, using data compression to reduce checkpoint's size [11,21,24], it is also focus on the techniques that detect modified data but reach the minimum of size. Some typical techniques are using page-based protection to identify the pages in memory that have been modified [11,22,23], using word-level granularity [21,12], using block encoding [22], using user-directed and memory exclusion [11], using live variable analysis [24]. In CAPE, Discontinuous Incremental Checkpointing (DICKPT) is a development based on incremental checkpointing, that contains two kinds of data, register information and modified data of the process.…”

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

See 2 more Smart Citations

Time-stamp incremental checkpointing and its applying for an optimization of execution model to improve performance of CAPE

Tran

Renault

et al. 2018

IJCAI

View full text Add to dashboard Cite

CAPE, which stands for Checkpointing-Aided Parallel Execution, is a checkpoint-based approach to automatically translate and execute OpenMP programs on distributed-memory architectures. This approach demonstrates high-performance and complete compatibility with OpenMP on distributed-memory systems. In CAPE, checkpointing is one of the main factors acted on the performance of the system. This is shown over two versions of CAPE. The first version based on complete checkpoints is too slow as compared to the second version based on Discontinuous Incremental Checkpointing. This paper presents an improvement of Discontinuous Incremental Checkpointing, and a new execution model for CAPE using new techniques of checkpointing. It contributes to improve the performance and make CAPE even more flexible. Povzetek: Predstavljena je izboljšava CAPE-paralelno izvajanje, usmerjeno s podporo redundance.

show abstract

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

Section: Discontinuous Incremental Checkpointing On Capementioning

confidence: 99%

See 1 more Smart Citation

Time-stamp incremental checkpointing and its applying for an optimization of execution model to improve performance of CAPE

Tran

Renault

et al. 2018

IJCAI

View full text Add to dashboard Cite

show abstract

“…To further reduce the overheads of the checkpoint recovery process, Cores et al . [28] carry out the study on how to reduce the size of the checkpoint files. Also, for large-scale distributed systems, Wei et al .…”

Section: Related Workmentioning

confidence: 99%

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Pang

Wang

2014

PLoS ONE

View full text Add to dashboard Cite

Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue which directly impacts the system total overheads. Motivated by these concerns, this paper presents a new model by introducing i-checkpoints into the existing two-level checkpoint recovery scheme to deal with the more probable failures with the smaller cost and the faster speed. The proposed scheme is independent of the specific failure distribution type and can be applied to different failure distribution types. We respectively make analyses between the two-level incremental and two-level checkpoint recovery schemes with the Weibull distribution and exponential distribution, both of which fit with the actual failure distribution best. The comparison results show that the total overheads of setting checkpoints, the total re-computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all significantly smaller than those in the two-level checkpoint recovery scheme. At last, limitations of our study are discussed, and at the same time, open questions and possible future work are given.

show abstract

“…We refer to the literature (e.g. Bautista-Gomez and Cappello, 2015;Berrocal et al, 2015Berrocal et al, , 2016Turnbull and Alldrin, 2003) for fault detection and prediction, as well as for other techniques, such as advanced focused checkpointing strategies (Cores et al, 2013;Islam et al, 2013;Kohl et al, 2019;Losada et al, 2019;Plank et al, 1995;Rodríguez et al, 2010;Sancho et al, 2004;Tao et al, 2018), selective identification of parts of models in need of reliability (Bridges et al, 2012;Hoemmen et al, 2011), self-stabilizing iterative solvers (Sao and Vuduc, 2013), data compression techniques (Di and Cappello, 2016), global view resilience (Chien et al, 2015), and resilience in the framework of domain decomposition preconditioners (Rizzi et al, 2018a(Rizzi et al, , 2018bSargsyan et al, 2015).…”

Section: Introductionmentioning

confidence: 99%

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Benacchio

Bonaventura

Altenbernd

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

show abstract

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Cited by 20 publications

References 35 publications

Time-stamp incremental checkpointing and its applying for an optimization of execution model to improve performance of CAPE

Time-stamp incremental checkpointing and its applying for an optimization of execution model to improve performance of CAPE

Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Contact Info

Product

Resources

About