SC18: International Conference for High Performance Computing, Networking, Storage and Analysis 2018
DOI: 10.1109/sc.2018.00047
|View full text |Cite
|
Sign up to set email alerts
|

Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
20
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(21 citation statements)
references
References 20 publications
1
20
0
Order By: Relevance
“…A key contribution of this paper is a mathematical analysis of the restart strategy, with a closed-form formula for its optimal checkpointing period. We show that the optimal checkpointing period for the restart strategy has the order Θ(µ 2 3 ), instead of the Θ(µ 1 2 ) used in previous works for no-restart as an extension of the Young/Daly formula [11,20,25]. Hence, as the error rate increases, the optimal period becomes much longer than the value that has been used in all previous works (with no-restart).…”
Section: Introductionmentioning
confidence: 91%
See 4 more Smart Citations
“…A key contribution of this paper is a mathematical analysis of the restart strategy, with a closed-form formula for its optimal checkpointing period. We show that the optimal checkpointing period for the restart strategy has the order Θ(µ 2 3 ), instead of the Θ(µ 1 2 ) used in previous works for no-restart as an extension of the Young/Daly formula [11,20,25]. Hence, as the error rate increases, the optimal period becomes much longer than the value that has been used in all previous works (with no-restart).…”
Section: Introductionmentioning
confidence: 91%
“…The platform is subject to fail-stop errors, or failures, that interrupt the application. Similarly to previous work [17,20,25], for the mathematical analysis, we assume that errors are independent and identically distributed (IID), and that they strike each processor according to an exponential probability distribution exp(λ) with support [0, ∞), probability density function (PDF) f (t) = λe −λt and cumulative distribution function (CDF) F (T ) = P(X ≤ T ) = 1 − e −λT . We also introduce the reliability function G(T ) = 1 − F (T ) = e −λT .…”
Section: Modelmentioning
confidence: 99%
See 3 more Smart Citations