Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing 2020
DOI: 10.1137/1.9781611976137.8
|View full text |Cite
|
Sign up to set email alerts
|

Scalable Resilience Against Node Failures for Communication-Hiding Preconditioned Conjugate Gradient and Conjugate Residual Methods

Abstract: The observed and expected continued growth in the number of nodes in large-scale parallel computers gives rise to two major challenges: global communication operations are becoming major bottlenecks due to their limited scalability, and the likelihood of node failures is increasing. We study an approach for addressing these challenges in the context of solving large sparse linear systems. In particular, we focus on the pipelined preconditioned conjugate gradient (PPCG) method, which has been shown to successfu… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 44 publications
0
2
0
Order By: Relevance
“…In contrast to restarting, a number of algorithm-based recovery strategies have been proposed, including approximate or heuristic interpolation methods (Agullo et al, 2016a, 2016b). An approach of exactly recovering the state of the iterative solver before the node failure has been investigated for the Preconditioned Conjugate Gradient (PCG) and related methods (Levonyak et al, 2020; Pachajoa et al, 2018). This also includes studying scenarios with multiple simultaneous node failures (Pachajoa et al, 2019) and scenarios where no replacement nodes are available (Pachajoa et al, 2019).…”
Section: Numerical Algorithms For Resiliencementioning
confidence: 99%
“…In contrast to restarting, a number of algorithm-based recovery strategies have been proposed, including approximate or heuristic interpolation methods (Agullo et al, 2016a, 2016b). An approach of exactly recovering the state of the iterative solver before the node failure has been investigated for the Preconditioned Conjugate Gradient (PCG) and related methods (Levonyak et al, 2020; Pachajoa et al, 2018). This also includes studying scenarios with multiple simultaneous node failures (Pachajoa et al, 2019) and scenarios where no replacement nodes are available (Pachajoa et al, 2019).…”
Section: Numerical Algorithms For Resiliencementioning
confidence: 99%
“…In [16], Levonyak et al extend the concept of ESR to the pipelined PCG algorithm, while maintaining its communication-hiding properties.…”
Section: Related Workmentioning
confidence: 99%