Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications

Losada, Nuria Riopérez; Martn, Mara J.; Rodrguez, Gabriel; Gonzlez, Patricia

doi:10.1016/j.procs.2016.05.294

Cited by 8 publications

(21 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As will be discussed in the present section, the application-level checkpoint and restart (ALCR) mechanism is the most effective mechanism for building software applications that are fault tolerant from the beginning [37][38][39]. However, since it is based on the deliberate insertion of checkpoints into the source code, it requires significant expertise and development effort.…”

Section: Optimum Checkpoint Recommendationmentioning

confidence: 99%

“…Checkpoint and rollback/recovery is one of the most widely-used mechanisms for adding fault tolerance to software applications [37][38][39]. It was originally developed for enhancing the reliability of transaction-oriented computer systems (e.g.…”

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

“…On the contrary, a "safe copy" (i.e. a checkpoint) of the overall execution state of the application should be taken and saved in a secondary file system that cannot be tampered by failures [37,52]. This safe state can be used for recovering the execution of the program in case of a failure.…”

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

“…Unlike its counterparts, it necessitates changes to the source code of the applications in order to define (i) the locations of the checkpoints, (ii) the checkpointing frequency, and (iii) the data that should be checkpointed. Although it requires significant development effort, it is considered the most effective CR approach [37][38][39], as it allows the creation of checkpoints with smaller memory footprints, since the minimum amount of information required for restoring the application state is essentially saved. A great number of tools for implementing ALCR in software applications can be found in the related literature [58].…”

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

“…However, the CR approach has recently become an attractive area of research due to our increasing reliance on long-running multi-process HPC applications. Such applications are characterized by expensive and time-consuming computations, and therefore excessive re-computation should be avoided in case of a failure [37]. For these applications, a distributed CR scheme should be employed, in which the checkpoints of the individual processes that constitute the parallel job should be effectively combined in order to create consistent recovery states of the overall parallel application.…”

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

See 4 more Smart Citations

Static Analysis-Based Approaches for Secure Software Development

Siavvas

Gelenbe

Kehagias

et al. 2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. Software security is a matter of major concern for software development enterprises that wish to deliver highly secure software products to their customers. Static analysis is considered one of the most effective mechanisms for adding security to software products. The multitude of static analysis tools that are available provide a large number of raw results that may contain security-relevant information, which may be useful for the production of secure software. Several mechanisms that can facilitate the production of both secure and reliable software applications have been proposed over the years. In this paper, two such mechanisms, particularly the vulnerability prediction models (VPMs) and the optimum checkpoint recommendation (OCR) mechanisms, are theoretically examined, while their potential improvement by using static analysis is also investigated. In particular, we review the most significant contributions regarding these mechanisms, identify their most important open issues, and propose directions for future research, emphasizing on the potential adoption of static analysis for addressing the identified open issues. Hence, this paper can act as a reference for researchers that wish to contribute in these subfields, in order to gain solid understanding of the existing solutions and their open issues that require further research.

show abstract

Section: Optimum Checkpoint Recommendationmentioning

confidence: 99%

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

Section: Transaction-oriented Systems and Optimum Checkpoint Intervalmentioning

confidence: 99%

See 3 more Smart Citations

Static Analysis-Based Approaches for Secure Software Development

Siavvas

Gelenbe

Kehagias

et al. 2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

show abstract

SDK4ED: a platform for building energy efficient, dependable, and maintainable embedded software

Siavvas,

Tsoukalas,

Marantos

et al. 2024

Autom Softw Eng

View full text Add to dashboard Cite

Optimum checkpoints for programs with loops

Siavvas

Gelenbe

2019

Simulation Modelling Practice and Theory

View full text Add to dashboard Cite

Checkpoints are widely used to improve the performance of computer systems and programs in the presence of failures, and significantly reduce the cost of restarting a program each time that it fails. Application level checkpointing has been proposed for programs which may execute on platforms which are prone to failures, and also to reduce the execution time of programs which are prone to internal failures. Thus we develop a mathematical model to estimate the average execution time of a program in the presence of failures, without and with application level checkpointing, and use it to estimate the optimum interval in number of instructions executed between successive checkpoints. The case of programs with loops and nested loops is also discussed. The results are illustrated with several numerical examples.

show abstract

Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications

Cited by 8 publications

References 5 publications

Static Analysis-Based Approaches for Secure Software Development

Static Analysis-Based Approaches for Secure Software Development

SDK4ED: a platform for building energy efficient, dependable, and maintainable embedded software

Optimum checkpoints for programs with loops

Contact Info

Product

Resources

About