Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2019
DOI: 10.1145/3295500.3356171
|View full text |Cite
|
Sign up to set email alerts
|

Replication is more efficient than you think

Abstract: This paper revisits replication coupled with checkpointing for failstop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period T no MTTI = √… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(1 citation statement)
references
References 38 publications
(85 reference statements)
0
1
0
Order By: Relevance
“…The combination of checkpointing the output of tasks and replicating for application-specific detection is explored in [2] for a linear workflow context, in the presence of both fail-stop and silent faults. Finally, in a recent study, the authors of [21] explore the combination of replication with checkpointing for fail-stop errors, and compute the optimal checkpoint interval for this approach.…”
Section: Background and Related Workmentioning
confidence: 99%
“…The combination of checkpointing the output of tasks and replicating for application-specific detection is explored in [2] for a linear workflow context, in the presence of both fail-stop and silent faults. Finally, in a recent study, the authors of [21] explore the combination of replication with checkpointing for fail-stop errors, and compute the optimal checkpoint interval for this approach.…”
Section: Background and Related Workmentioning
confidence: 99%