2019
DOI: 10.15803/ijnc.9.1_2
|View full text |Cite
|
Sign up to set email alerts
|

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

Abstract: Large-scale platforms currently experience errors from two different sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear workflows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 41 publications
0
3
0
Order By: Relevance
“…Fault tolerant protocols for other parallel programming models, such as PGAS [20] have been also explored. The combination of checkpointing the output of tasks and replicating for application-specific detection is explored in [2] for a linear workflow context, in the presence of both fail-stop and silent faults. Finally, in a recent study, the authors of [21] explore the combination of replication with checkpointing for fail-stop errors, and compute the optimal checkpoint interval for this approach.…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Fault tolerant protocols for other parallel programming models, such as PGAS [20] have been also explored. The combination of checkpointing the output of tasks and replicating for application-specific detection is explored in [2] for a linear workflow context, in the presence of both fail-stop and silent faults. Finally, in a recent study, the authors of [21] explore the combination of replication with checkpointing for fail-stop errors, and compute the optimal checkpoint interval for this approach.…”
Section: Background and Related Workmentioning
confidence: 99%
“…SEDAR can detect and recover from all transient faults that cause SDC and TOE (Time Out Errors). Three different ways are provided by SEDAR so it can achieve full silent error coverage: (1) only detection with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery utilizing a single safe application-level checkpoint. Each of these alternatives has particular features and provides a different cost-performance trade-off.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation