Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers

Gholami, Masoud; Schintke, Florian; Schütt, Thorsten

doi:10.1145/3229710.3229755

Cited by 4 publications

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A fault-tolerant mechanism can be implemented at systemlevel, by modifying the OS kernel or the hardware [6], [18], [19]; user-level, by linking the program to fault-tolerant libraries [20]- [23]; or application-level, by injecting the resilient code directly into the application (i.e., relaying on the programmer's domain knowledge or by means of a pre-processor) [8], [9]. Despite some recent relevant attempts [24] to combine system-and user-level checkpointing to minimise the failure overhead, I/O bottleneck remains the main concern of C/R techniques. Diskless checkpointing [11] (and its following enhancements [25], [26]) helped to contain this problem, by encoding and saving the state of the computation into the internal memory of redundant computing nodes, instead of reliable storage.…”

Section: Related Workmentioning

confidence: 99%

Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint

Loreti,

Artioli,

Ciampolini

2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the HPC research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with fault tolerance features.In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain fault tolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.

show abstract