2016
DOI: 10.1002/nla.2059
|View full text |Cite
|
Sign up to set email alerts
|

Numerical recovery strategies for parallel resilient Krylov linear solvers

Abstract: International audienceAs the computational power of high performance computing (HPC) systems continues to increase by using a huge number of cores or specialized processing units, HPC applications are increasingly prone to faults. In this paper, we present a new class of numerical fault tolerance algorithms to cope with node crashes in parallel distributed environments. This new resilient scheme is designed at application level and does not require extra resources, i.e., computational unit or computing time, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 26 publications
(15 citation statements)
references
References 37 publications
0
15
0
Order By: Relevance
“…1.1.3 Notation. We use a notation similar to [2] to denote sections of matrices and vectors. We refer to the set of all indices as I {1, 2, .…”
Section: Problem Setting and Assumptionsmentioning
confidence: 99%
“…1.1.3 Notation. We use a notation similar to [2] to denote sections of matrices and vectors. We refer to the set of all indices as I {1, 2, .…”
Section: Problem Setting and Assumptionsmentioning
confidence: 99%
“…Interpolation-Restart (IR) techniques are designed to cope with node crashes (hard faults) in a parallel distributed environment (Agullo et al, 2015(Agullo et al, , 2017(Agullo et al, , 2016a(Agullo et al, , 2016b. The methods can be designed at the algebraic level for the solution both of linear systems and of eigenvalue problems.…”
Section: Interpolation-restartmentioning
confidence: 99%
“…For sparse matrix iterative solvers, such as SOR, GMRES and CG-iterations, the previously mentioned approaches are not readily applicable (Sloan et al, 2012). Suitable extensions are presented in (Agullo et al, 2013; Bridges et al, 2012; Chen, 2013; Roy-Chowdhury and Banerjee, 1993; Stoyanov and Webster, 2013). Cui et al (2017) exploit the mathematical structure of a parallel subspace correction method.…”
Section: Fault Recoverymentioning
confidence: 99%