Proceedings of the 21st European MPI Users' Group Meeting 2014
DOI: 10.1145/2642769.2642773
|View full text |Cite
|
Sign up to set email alerts
|

GPU-Aware Intranode MPI_Allreduce

Abstract: Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI applications. This paper investigates the efficient design of intranode MPI Allreduce operation in GPU clusters. We propose two design alternatives that exploit in-GPU reduction and fast intranode communication capabilities of modern GPUs. Our GPU shared-buffer aw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
3
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 11 publications
(23 reference statements)
0
3
0
Order By: Relevance
“…The work in this paper extends our prior study in different ways. While our collective designs in our other work target a single node with a single GPU, in this paper, we extend our work and propose a three‐level hierarchical framework for GPU collectives for clusters with multi‐GPU nodes. The intention of this framework is to highlight the importance of selecting the right algorithm at each hierarchy level in performing the GPU collective operations.…”
Section: Introductionmentioning
confidence: 56%
“…The work in this paper extends our prior study in different ways. While our collective designs in our other work target a single node with a single GPU, in this paper, we extend our work and propose a three‐level hierarchical framework for GPU collectives for clusters with multi‐GPU nodes. The intention of this framework is to highlight the importance of selecting the right algorithm at each hierarchy level in performing the GPU collective operations.…”
Section: Introductionmentioning
confidence: 56%
“…Furthermore, as optimal algorithms depend on both message sizes as well as architectural topology, autotuners can determine the best algorithm for various scenarios [27], [28]. Finally, collective algorithms can be optimized for accelerated topologies, such as those containing Xeon Phi's [29] and GPU's [30]- [33].…”
Section: B Related Workmentioning
confidence: 99%
“…For example, performance of the MPI_Broadcast is improved by performing a hierarchical operation, using NVIDIA Collective Communications Library (NCCL) onnode and MPI for all inter-node communication [17]. In addition, the CUDA IPC can be utilized to reduce data on the GPU throughout intra-node MPI_Allreduce operations [18]. Furthermore, algorithms to optimize the performance of CUDA-aware collectives have been explored [19].…”
Section: Introductionmentioning
confidence: 99%