2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI) 2019
DOI: 10.1109/exampi49596.2019.00008
|View full text |Cite
|
Sign up to set email alerts
|

Node-Aware Improvements to Allreduce

Abstract: The MPI_Allreduce collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on nodeagnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yiel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(3 citation statements)
references
References 31 publications
0
3
0
Order By: Relevance
“…By communication, we mean the movement of data, both between levels of the memory hierarchy in sequential implementations and between parallel processors in parallel implementations. It is well established that communication and, in particular, synchronization between parallel processors, is the dominant cost (in terms of both time and energy) in largescale settings; see, e.g., Bienz et al [6]. It is therefore of interest to understand the potential trade-offs between the numerical properties of loss of orthogonality and stability in finite precision and the cost of communication in terms of number of messages and number of words moved.…”
Section: Block Gram-schmidt Variants and A Skeleton-muscle Analogymentioning
confidence: 99%
“…By communication, we mean the movement of data, both between levels of the memory hierarchy in sequential implementations and between parallel processors in parallel implementations. It is well established that communication and, in particular, synchronization between parallel processors, is the dominant cost (in terms of both time and energy) in largescale settings; see, e.g., Bienz et al [6]. It is therefore of interest to understand the potential trade-offs between the numerical properties of loss of orthogonality and stability in finite precision and the cost of communication in terms of number of messages and number of words moved.…”
Section: Block Gram-schmidt Variants and A Skeleton-muscle Analogymentioning
confidence: 99%
“…Their parallel strongscaling is limited by the number and frequency of global reductions, in the form of MPI AllReduce. These communication patterns are expensive [6]. Our new algorithms are designed such that they require only one reduction to normalize each vector and apply projections.…”
Section: Introductionmentioning
confidence: 99%
“…Improved architecture-aware performance models, such as the max-rate and node-aware models, have led to the development of methods for improving communication costs. For instance, the drastic performance differences between intra-and inter-node communication motivated node-aware communication optimizations on previous generation architectures [8]- [10].…”
Section: Introductionmentioning
confidence: 99%