GPU-Aware Intranode MPI_Allreduce

Faraji, Iman; Afsahi, Ahmad

doi:10.1145/2642769.2642773

Cited by 9 publications

(4 citation statements)

References 11 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The work in this paper extends our prior study in different ways. While our collective designs in our other work target a single node with a single GPU, in this paper, we extend our work and propose a three‐level hierarchical framework for GPU collectives for clusters with multi‐GPU nodes. The intention of this framework is to highlight the importance of selecting the right algorithm at each hierarchy level in performing the GPU collective operations.…”

Section: Introductionmentioning

confidence: 56%

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary GPU accelerators have established themselves in the state‐of‐the‐art clusters by offering high performance and energy efficiency. In such systems, efficient inter‐process GPU communication is of paramount importance to application performance. This paper investigates various algorithms in conjunction with the latest GPU features to improve GPU collective operations. First, we propose a GPU Shared Buffer‐aware (GSB) algorithm and a Binomial Tree Based (BTB) algorithm for GPU collectives on single‐GPU nodes. We then propose a hierarchical framework for clusters with multi‐GPU nodes. By studying various combinations of algorithms, we highlight the importance of choosing the right algorithm within each level. The evaluation of our framework on MPI_Allreduce shows promising performance results for large message sizes. To address the shortcoming for small and medium messages, we present the benefit of using the Hyper‐Q feature and the MPS service in jointly using CUDA IPC and host‐staged copy types to perform multiple inter‐process communications. However, we argue that efficient designs are still required to further harness this potential. Accordingly, we propose a static and a dynamic algorithm for MPI_Allgather and MPI_Allreduce and present their effectiveness on various message sizes. Our profiling results indicate that the achieved performance is mainly rooted in overlapping different copy types.

show abstract

Section: Introductionmentioning

confidence: 56%

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Furthermore, as optimal algorithms depend on both message sizes as well as architectural topology, autotuners can determine the best algorithm for various scenarios [27], [28]. Finally, collective algorithms can be optimized for accelerated topologies, such as those containing Xeon Phi's [29] and GPU's [30]- [33].…”

Section: B Related Workmentioning

confidence: 99%

Node-Aware Improvements to Allreduce

Bienz

Olson

Gropp

2019

2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI)

View full text Add to dashboard Cite

The MPI_Allreduce collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on nodeagnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yielding a large number of inactive processes at each inter-node step. In this paper, we present an algorithm that uses the multiple processes available per node to reduce the maximum number of inter-node messages communicated by a single process, improving the performance of allreduce operations, particularly for small message sizes.

show abstract

“…For example, performance of the MPI_Broadcast is improved by performing a hierarchical operation, using NVIDIA Collective Communications Library (NCCL) onnode and MPI for all inter-node communication [17]. In addition, the CUDA IPC can be utilized to reduce data on the GPU throughout intra-node MPI_Allreduce operations [18]. Furthermore, algorithms to optimize the performance of CUDA-aware collectives have been explored [19].…”

Section: Introductionmentioning

confidence: 99%

Modeling Data Movement Performance on Heterogeneous Architectures

Bienz¹,

Olson²,

Gropp³

et al. 2020

Preprint

View full text Add to dashboard Cite

The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and even nearby jobs. Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed. Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-GPU communication.In this paper, we present performance models for the various paths of inter-node communication on modern heterogeneous architectures. We model the performance of utilizing all available CPU cores as well as the benefit of copying data to the CPUs when sending many messages. Finally, we present optimizations for a variety of MPI collectives based on the performance expectations provided by these models.

show abstract

GPU-Aware Intranode MPI_Allreduce

Cited by 9 publications

References 11 publications

Design considerations for GPU‐aware collective communications in MPI

Design considerations for GPU‐aware collective communications in MPI

Node-Aware Improvements to Allreduce

Modeling Data Movement Performance on Heterogeneous Architectures

Contact Info

Product

Resources

About