Proceedings of the 47th International Conference on Parallel Processing Companion 2018
DOI: 10.1145/3229710.3229755
|View full text |Cite
|
Sign up to set email alerts
|

Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 17 publications
0
0
0
Order By: Relevance
“…A fault-tolerant mechanism can be implemented at systemlevel, by modifying the OS kernel or the hardware [6], [18], [19]; user-level, by linking the program to fault-tolerant libraries [20]- [23]; or application-level, by injecting the resilient code directly into the application (i.e., relaying on the programmer's domain knowledge or by means of a pre-processor) [8], [9]. Despite some recent relevant attempts [24] to combine system-and user-level checkpointing to minimise the failure overhead, I/O bottleneck remains the main concern of C/R techniques. Diskless checkpointing [11] (and its following enhancements [25], [26]) helped to contain this problem, by encoding and saving the state of the computation into the internal memory of redundant computing nodes, instead of reliable storage.…”
Section: Related Workmentioning
confidence: 99%
“…A fault-tolerant mechanism can be implemented at systemlevel, by modifying the OS kernel or the hardware [6], [18], [19]; user-level, by linking the program to fault-tolerant libraries [20]- [23]; or application-level, by injecting the resilient code directly into the application (i.e., relaying on the programmer's domain knowledge or by means of a pre-processor) [8], [9]. Despite some recent relevant attempts [24] to combine system-and user-level checkpointing to minimise the failure overhead, I/O bottleneck remains the main concern of C/R techniques. Diskless checkpointing [11] (and its following enhancements [25], [26]) helped to contain this problem, by encoding and saving the state of the computation into the internal memory of redundant computing nodes, instead of reliable storage.…”
Section: Related Workmentioning
confidence: 99%