CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Chu, C. Y. Cyrus; Hamidouche, Khaled; Venkatesh, Akshay; Awan, Ammar Ahmad; Panda, Dhabaleswar K.

doi:10.1109/ccgrid.2016.111

Cited by 14 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In their work, host‐staged copy type was used for inter‐process communications. Chu et al investigated various algorithms for GPU‐aware MPI_Allreduce across the node. However, none of these work considered hierarchical collective designs.…”

Section: Related Workmentioning

confidence: 99%

“…To achieve efficient inter‐process GPU communication, MPI libraries should be tuned and become GPU‐aware to be able to efficiently communicate the data residing on the GPU memory. In this regard, researchers have started looking into incorporating GPU‐awareness into the MPI library, targeting both point‐to‐point and collective communications …”

Section: Introductionmentioning

confidence: 99%

“…CUDA IPC and the host‐staged copy are two different data copy types that may be used to perform GPU communication between processes residing on a single node. Previous research studies, however, have used only one of these copy types to perform the GPU inter‐process communication . Therefore, utilizing the alternative copy type, the host‐staged copy, for small and medium‐size messages could be a viable solution.…”

Section: Introductionmentioning

confidence: 99%

“…In this regard, researchers have started looking into incorporating GPU-awareness into the MPI library, targeting both point-to-point and collective communications. [4][5][6][7][8][9][10][11] Concurrency Computat Pract Exper. 2018;30:e4667.…”

mentioning

confidence: 99%

“…Previous research studies, however, have used only one of these copy types to perform the GPU inter-process communication. 6,8,10,11 Therefore, utilizing the alternative copy type, the host-staged copy, for small and medium-size messages could be a viable solution. This paper investigates how we can potentially use different copy types in conjunction with each other to improve the GPU collective operation performance on a wide range of message sizes.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary GPU accelerators have established themselves in the state‐of‐the‐art clusters by offering high performance and energy efficiency. In such systems, efficient inter‐process GPU communication is of paramount importance to application performance. This paper investigates various algorithms in conjunction with the latest GPU features to improve GPU collective operations. First, we propose a GPU Shared Buffer‐aware (GSB) algorithm and a Binomial Tree Based (BTB) algorithm for GPU collectives on single‐GPU nodes. We then propose a hierarchical framework for clusters with multi‐GPU nodes. By studying various combinations of algorithms, we highlight the importance of choosing the right algorithm within each level. The evaluation of our framework on MPI_Allreduce shows promising performance results for large message sizes. To address the shortcoming for small and medium messages, we present the benefit of using the Hyper‐Q feature and the MPS service in jointly using CUDA IPC and host‐staged copy types to perform multiple inter‐process communications. However, we argue that efficient designs are still required to further harness this potential. Accordingly, we propose a static and a dynamic algorithm for MPI_Allgather and MPI_Allreduce and present their effectiveness on various message sizes. Our profiling results indicate that the achieved performance is mainly rooted in overlapping different copy types.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM

Chu

Potluri

Goswami

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Awan

Bédorf

Chu

et al. 2019

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Self Cite

View full text Add to dashboard Cite

The current wave of advances in Machine Learning (ML) and Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of easy-to-use software frameworks like TensorFlow (TF), Caffe and Torch. TensorFlow has been, by far, the most widely adopted ML/DL framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+'X': X = (InfiniBand Verbs, Message Passing Interface (MPI), and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (#6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures (e.g ResNet vs. MobileNet). Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using the Allreduce communication pattern. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits 1) CUDA kernels to perform large reductions on the GPU and 2) A pointer cache to avoid overheads involved in queries to the CUDA driver. Our proposed designs have been implemented in MVAPICH2-GDR and offer 5-17× better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages on 16 GPUs (nodes). The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8× and 3.2× higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.

show abstract

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Cited by 14 publications

References 12 publications

Design considerations for GPU‐aware collective communications in MPI

Design considerations for GPU‐aware collective communications in MPI

Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Contact Info

Product

Resources

About