Node-Aware Improvements to Allreduce

Bienz, Amanda; Olson, Luke N.; Gropp, William

doi:10.1109/exampi49596.2019.00008

Cited by 13 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By communication, we mean the movement of data, both between levels of the memory hierarchy in sequential implementations and between parallel processors in parallel implementations. It is well established that communication and, in particular, synchronization between parallel processors, is the dominant cost (in terms of both time and energy) in largescale settings; see, e.g., Bienz et al [6]. It is therefore of interest to understand the potential trade-offs between the numerical properties of loss of orthogonality and stability in finite precision and the cost of communication in terms of number of messages and number of words moved.…”

Section: Block Gram-schmidt Variants and A Skeleton-muscle Analogymentioning

confidence: 99%

An overview of block Gram-Schmidt methods and their stability properties

Carson¹,

Lund²,

Rozložńık³

et al. 2020

Preprint

View full text Add to dashboard Cite

Block Gram-Schmidt algorithms comprise essential kernels in many scientific computing applications, but for many commonly used variants, a rigorous treatment of their stability properties remains open. This survey provides a comprehensive categorization of block Gram-Schmidt algorithms, especially those used in Krylov subspace methods to build orthonormal bases one block vector at a time. All known stability results are assembled, and new results are summarized or conjectured for important communication-reducing variants. A diverse array of numerical illustrations are presented, along with the Matlab code for reproducing the results in a publicly available at repository https://github.com/katlund/BlockStab. A number of open problems are discussed, and an appendix containing all algorithms type-set in a uniform fashion is provided.

show abstract

Section: Block Gram-schmidt Variants and A Skeleton-muscle Analogymentioning

confidence: 99%

An overview of block Gram-Schmidt methods and their stability properties

Carson¹,

Lund²,

Rozložńık³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Their parallel strongscaling is limited by the number and frequency of global reductions, in the form of MPI AllReduce. These communication patterns are expensive [6]. Our new algorithms are designed such that they require only one reduction to normalize each vector and apply projections.…”

Section: Introductionmentioning

confidence: 99%

Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers

Bielich¹,

Langou²,

Thomas³

et al. 2021

Preprint

View full text Add to dashboard Cite

The parallel strong-scaling of Krylov iterative methods is largely determined by the number of global reductions required at each iteration. The GMRES and Krylov-Schur algorithms compute the Arnoldi expansion for nonsymmetric matrices. The underlying algorithm is "left-looking" and processes one column at a time. Thus, at least one global reduction is required per iteration. The usual method for generating the orthogonal Krylov basis for the Krylov-Schur algorithm is classical Gram Schmidt applied twice (CGS2), requiring three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is applied to the Arnoldi-QR iteration. Strong-scaling results are presented for finding eigenvalue-pairs of nonsymmetric matrices. A preliminary attempt to derive a similar parallel method (one reduction per Arnoldi iteration with a robust orthogonalization scheme) was presented by Hernandez et al. [1]. Unlike our approach, their method is not forward stable for eigenvalues.

show abstract

“…Improved architecture-aware performance models, such as the max-rate and node-aware models, have led to the development of methods for improving communication costs. For instance, the drastic performance differences between intra-and inter-node communication motivated node-aware communication optimizations on previous generation architectures [8]- [10].…”

Section: Introductionmentioning

confidence: 99%

Modeling Data Movement Performance on Heterogeneous Architectures

Bienz¹,

Olson²,

Gropp³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and even nearby jobs. Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed. Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-GPU communication.In this paper, we present performance models for the various paths of inter-node communication on modern heterogeneous architectures. We model the performance of utilizing all available CPU cores as well as the benefit of copying data to the CPUs when sending many messages. Finally, we present optimizations for a variety of MPI collectives based on the performance expectations provided by these models.

show abstract

Node-Aware Improvements to Allreduce

Cited by 13 publications

References 31 publications

An overview of block Gram-Schmidt methods and their stability properties

An overview of block Gram-Schmidt methods and their stability properties

Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers

Modeling Data Movement Performance on Heterogeneous Architectures

Contact Info

Product

Resources

About