Communication lower bounds and optimal algorithms for numerical linear algebra

Ballard, Grey; Carson, Erin; Demmel, James; Hoemmen, Mark Frederick; Knight, Nicholas; Schwartz, Oded

doi:10.1017/s0962492914000038

Cited by 99 publications

(136 citation statements)

References 172 publications

(251 reference statements)

Supporting

Mentioning

132

Contrasting

Order By: Relevance

“…The strategy proposed in Eqn. (5) for selecting the nodes to keep the redundant copies of p (j−1) I i and p (j) I i is a reasonably good heuristic for minimizing communication overheads during SpMV if we assume that the entries of the system matrix A are mostly clustered around the diagonal (since it then is likely that there are some elements which have to be sent anyway from node i to node d ik and, thus, there is no extra latency for establishing a new connection; see Sec. 5 for a more detailed discussion).…”

Section: Tolerating Multiple Node Failuresmentioning

confidence: 99%

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

Pachajoa

Levonyak

Gansterer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation.In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we re ne the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scienti c Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.

show abstract

Section: Tolerating Multiple Node Failuresmentioning

confidence: 99%

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

Pachajoa

Levonyak

Gansterer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…For example, here is a citation from [5] relevant to our study of pivoting: "The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck".…”

Section: Numerical Gaussian Elimination With No Pivoting and Block Gamentioning

confidence: 99%

Random multipliers numerically stabilize Gaussian and block Gaussian elimination: Proofs and an extension to low-rank approximation

Pan

Qian

Yan

2015

Linear Algebra and its Applications

View full text Add to dashboard Cite

We study two applications of standard Gaussian random multipliers. At first we prove that with a probability close to 1 such a multiplier is expected to numerically stabilize Gaussian elimination with no pivoting as well as block Gaussian elimination. Then, by extending our analysis, we prove that such a multiplier is also expected to support low-rank approximation of a matrix without customary oversampling. Our test results are in good accordance with this formal study. The results remain similar when we replace Gaussian multipliers with random circulant or Toeplitz multipliers, which involve fewer random parameters and enable faster multiplication. We formally support the observed efficiency of random structured multipliers applied to approximation, but we still continue our research in the case of elimination. We ✩ Some results of this paper have been presented at 203 Gaussian elimination Pivoting Block Gaussian elimination Low-rank approximation SRFT matrices Random circulant matrices specify a narrow class of unitary inputs for which Gaussian elimination with no pivoting is numerically unstable and then prove that, with a probability close to 1, a Gaussian random circulant multiplier does not fix numerical stability problems for such inputs. We also prove that the power of the random circulant preprocessing increases if we also include random permutations.

show abstract

“…Serial and parallel variants of the matrix powers kernel, for both structured and general sparse matrices, are described in [31] and [2], which summarize most of [14] and elaborate on the implementation in [32]. Within [31], we refer the reader to the complexity analysis in Tables 2.3-4, the performance modeling in section 2.6, and the performance results in section 2.10.3 and section 2.11.3, which demonstrate that this optimization leads to speedups in practice.…”

Section: Communication-avoiding Kernelsmentioning

confidence: 99%

Accuracy of the $s$-Step Lanczos Method for the Symmetric Eigenproblem in Finite Precision

Carson¹,

Demmel²

2015

SIAM J. Matrix Anal. & Appl.

View full text Add to dashboard Cite

The s-step Lanczos method can achieve an O(s) reduction in data movement over the classical Lanczos method for a fixed number of iterations, allowing the potential for significant speedups on modern computers. However, although the s-step Lanczos method is equivalent to the classical Lanczos method in exact arithmetic, it can behave quite differently in finite precision. Increased roundoff errors can manifest as a loss of accuracy or deterioration of convergence relative to the classical method, reducing the potential performance benefits of the s-step approach. In this paper, we present for the first time a complete rounding error analysis of the s-step Lanczos method. Our methodology is analogous to Paige's rounding error analysis for classical Lanczos [IMA J. Appl. Math., 18 (1976), pp. 341-349]. Our analysis gives upper bounds on the loss of normality of and orthogonality between the computed Lanczos vectors, as well as a recurrence for the loss of orthogonality. We further demonstrate that bounds on accuracy for the finite precision Lanczos method given by Paige [Linear Algebra Appl., 34 (1980), pp. 235-258] can be extended to the s-step Lanczos case assuming a bound on the maximum condition number of the precomputed s-step Krylov bases. Our results confirm that the conditioning of the precomputed Krylov bases plays a large role in determining finite precision behavior. In particular, if one can enforce that the condition numbers of the precomputed s-step Krylov bases are not too large in any iteration, then the finite precision behavior of the s-step Lanczos method will be similar to that of classical Lanczos.

show abstract

Communication lower bounds and optimal algorithms for numerical linear algebra

Cited by 99 publications

References 172 publications

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

Random multipliers numerically stabilize Gaussian and block Gaussian elimination: Proofs and an extension to low-rank approximation

Accuracy of the $s$-Step Lanczos Method for the Symmetric Eigenproblem in Finite Precision

Contact Info

Product

Resources

About