The Impact of Global Communication Latency at Extreme Scales on Krylov Methods

Ashby, Thomas J.; Ghysels, Pieter; Heirman, Wim; Vanroose, Wim

doi:10.1007/978-3-642-33078-0_31

Cited by 8 publications

(7 citation statements)

References 12 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Convergence is nearly identical to that of standard GMRES, except for a small delay. In Figure 5.1, bottom, the Newton basis is used with the zeros of the th order scaled and shifted (to [1,2]) Chebyshev polynomial as shifts, again in Leja ordering. Convergence is similar to standard GMRES.…”

Section: Numerical Resultsmentioning

confidence: 99%

“…As the number of nodes increases, the global reductions required in lines 4 and 6 may well become the bottleneck [2,14,5] …”

Section: Standard Gmresmentioning

confidence: 99%

“…We did not include the underlying network topology in the model but refer the reader to [2] for a more detailed study of the feasibility of pipelining for different network configurations. The simplified cost model as presented here also does not include variability due to OS jitter, core speed variability, or load imbalance.…”

Section: Gmres Cgsmentioning

confidence: 99%

See 2 more Smart Citations

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

Ghysels¹,

Ashby²,

Meerbergen³

et al. 2013

SIAM J. Sci. Comput.

104

134

View full text Add to dashboard Cite

In the generalized minimal residual method (GMRES), the global all-to-all communication required in each iteration for orthogonalization and normalization of the Krylov base vectors is becoming a performance bottleneck on massively parallel machines. Long latencies, system noise, and load imbalance cause these global reductions to become very costly global synchronizations. In this work, we propose the use of nonblocking or asynchronous global reductions to hide these global communication latencies by overlapping them with other communications and calculations. A pipelined variation of GMRES is presented in which the result of a global reduction is used only one or more iterations after the communication phase has started. This way, global synchronization is relaxed and scalability is much improved at the expense of some extra computations. The numerical instabilities that inevitably arise due to the typical monomial basis by powering the matrix are reduced and often annihilated by using Newton or Chebyshev bases instead. Our parallel experiments on a medium-sized cluster show significant speedups of the pipelined solvers compared to standard GMRES. An analytical model is used to extrapolate the performance to future exascale systems.

show abstract

Section: Numerical Resultsmentioning

confidence: 99%

“…As the number of nodes increases, the global reductions required in lines 4 and 6 may well become the bottleneck [2,14,5] …”

Section: Standard Gmresmentioning

confidence: 99%

See 1 more Smart Citation

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

Ghysels¹,

Ashby²,

Meerbergen³

et al. 2013

SIAM J. Sci. Comput.

104

134

View full text Add to dashboard Cite

show abstract

“…This resulted in speedups and improved scalability on distributed-memory machines. An analogous pipelined version of CG is presented in [18], and the pipelining approach is discussed further in [19]. Another pipelined algorithm, currently implemented in the SLEPc library [20], is the Arnoldi method with delayed reorthogonalization (ADR) [21].…”

Section: Related Workmentioning

confidence: 99%

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid

Williams

Lijewski

Almgren

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Geometric multigrid solvers within adaptive mesh refinement (AMR) applications often reach a point where further coarsening of the grid becomes impractical as individual subdomain sizes approach unity. At this point the most common solution is to use a bottom solver, such as BiCGStab, to reduce the residual by a fixed factor at the coarsest level. Each iteration of BiCGStab requires multiple global reductions (MPI collectives). As the number of BiCGStab iterations required for convergence grows with problem size, and the time for each collective operation increases with machine scale, bottom solves in large-scale applications can constitute a significant fraction of the overall multigrid solve time. In this paper, we implement, evaluate, and optimize a communication-avoiding sstep formulation of BiCGStab (CABiCGStab for short) as a highperformance, distributed-memory bottom solver for geometric multigrid solvers. This is the first time s-step Krylov subspace methods have been leveraged to improve multigrid bottom solver performance. We use a synthetic benchmark for detailed analysis and integrate the best implementation into BoxLib in order to evaluate the benefit of a s-step Krylov subspace method on the multigrid solves found in the applications LMC and Nyx on up to 32,768 cores on the Cray XE6 at NERSC. Overall, we see bottom solver improvements of up to 4.2× on synthetic problems and up to 2.7× in real applications. This results in as much as a 1.5× improvement in solver performance in real applications.

show abstract

“…This inner product must be completed before p k can be formed, and at least part of p k must be completed before the start of the next iteration computing Ap k . It has been observed that waiting for the two inner products to complete can be very costly when using large numbers of processors [1,5].…”

mentioning

confidence: 99%

On the Convergence Rate of Variants of the Conjugate Gradient Algorithm in Finite Precision Arithmetic

Greenbaum,

Liu,

Chen

2019

Preprint

View full text Add to dashboard Cite

We consider three mathematically equivalent variants of the conjugate gradient (CG) algorithm and how they perform in finite precision arithmetic. It was shown in [Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences, Lin. Alg. Appl., 113 (1989), pp. 7-63] that under certain coditions, that may be satisfied by a finite precision CG computation, the convergence of that computation is like that of exact CG for a matrix with many eigenvalues distributed throughout tiny intervals about the eigenvalues of the given matrix. We determine to what extent each of these variants satisfies the desired conditions, using a set of test problems, and show that there is significant correlation between how well these conditions are satisfied and how well the finite precision computation converges before reaching its ultimately attainable accuracy. We show that for problems where the interval width makes a significant difference in the behavior of exact CG, the different CG variants behave differently in finite precision arithmetic. For problems where the interval width makes little difference or where the convergence of exact CG is essentially governed by the upper bound based on the square root of the condition number of the matrix, the different CG variants converge similarly in finite precision arithmetic until the ultimate level of accuracy is achieved.

show abstract

The Impact of Global Communication Latency at Extreme Scales on Krylov Methods

Cited by 8 publications

References 12 publications

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid

On the Convergence Rate of Variants of the Conjugate Gradient Algorithm in Finite Precision Arithmetic

Contact Info

Product

Resources

About