2016
DOI: 10.1177/1094342015594531
|View full text |Cite
|
Sign up to set email alerts
|

Efficient checkpoint/verification patterns

Abstract: International audienceErrors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
20
0

Year Published

2016
2016
2018
2018

Publication Types

Select...
3
2

Relationship

5
0

Authors

Journals

citations
Cited by 8 publications
(20 citation statements)
references
References 25 publications
0
20
0
Order By: Relevance
“…Furthermore, we determine the optimal configuration of a partial verification when its cost and recall can be traded off with each other. These results provide important extensions to the classical formulas in the field [27], [11], [4], [3], and to the best of our knowledge, are the first to include partial verifications. Unlike in the classical case, however, a silent error may not be detected by a partial verification and could get propagated to the subsequent work segments inside a pattern, thus significantly complicating the analysis.…”
Section: Introductionmentioning
confidence: 79%
See 4 more Smart Citations
“…Furthermore, we determine the optimal configuration of a partial verification when its cost and recall can be traded off with each other. These results provide important extensions to the classical formulas in the field [27], [11], [4], [3], and to the best of our knowledge, are the first to include partial verifications. Unlike in the classical case, however, a silent error may not be detected by a partial verification and could get propagated to the subsequent work segments inside a pattern, thus significantly complicating the analysis.…”
Section: Introductionmentioning
confidence: 79%
“…For example, in the classical protocol for fail-stop errors where verification is not needed, the optimal checkpointing period is known to be √ 2μC as given by Young [27] and Daly [11]. A similar result is also known for silent errors, and the optimal period in that case is μ(C + V * ) if only verified checkpoints are used [4], [3]. These formulas provide firstorder approximations to the optimal patterns in the respective scenarios, and are valid when the resilient parameters satisfy C, V * μ.…”
Section: Introductionmentioning
confidence: 85%
See 3 more Smart Citations