2018
DOI: 10.1016/j.jocs.2017.03.024
|View full text |Cite
|
Sign up to set email alerts
|

Multi-level checkpointing and silent error detection for linear workflows

Abstract: We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for failstop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(1 citation statement)
references
References 42 publications
(83 reference statements)
0
1
0
Order By: Relevance
“…In some studies, in addition to fail‐stops, silent errors are considered . In , disk checkpointing is combined with memory checkpointing.…”
Section: Related Studiesmentioning
confidence: 99%
“…In some studies, in addition to fail‐stops, silent errors are considered . In , disk checkpointing is combined with memory checkpointing.…”
Section: Related Studiesmentioning
confidence: 99%