Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Awan, Ammar Ahmad; Hamidouche, Khaled; Venkatesh, Akshay; Panda, D.K.

doi:10.1145/2966884.2966912

Cited by 42 publications

(18 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 6 shows the current positioning of Cylon in deep learning integration. To further enhance the distributed operations, we can add specific support to deep learning settings such as NCCL [19].…”

Section: Transport Layermentioning

confidence: 99%

High Performance Data Engineering Everywhere

Widanage¹,

Perera²,

Abeykoon³

et al. 2020

Preprint

View full text Add to dashboard Cite

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre-and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time.In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.

show abstract

Section: Transport Layermentioning

confidence: 99%

High Performance Data Engineering Everywhere

Widanage¹,

Perera²,

Abeykoon³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…For large and very large message range, we see that NCCL provides scalable performance. At the same time, our proposed pipelined chain designs in MVAPICH2-GDR allow us to achieve similar or better performance essentially alleviating the need to resort to NCCL augmented broadcast designs proposed in [4].…”

Section: B Intranode Performance Comparison (Micro-benchmark)mentioning

confidence: 91%

“…CUDA-Aware MPI runtimes like MVAPICH2-GDR are flexible enough to integrate third-party libraries like NCCL. In this context, we designed and evaluated NCCL-based MPI Bcast designs in our earlier work [4]. The hierarchical nature of collective communication in MVAPICH2 allowed us to exploit NCCL for intranode communication along with efficient and tuned designs for internode communication.…”

Section: Limitations Of Nccl-integrated Mpi Designsmentioning

confidence: 99%

“…To address these challenges, we first highlight the design complexity of CUDA-Aware MPI runtimes like MVAPICH2-GDR [21] and how it enables end applications to deliver the best performance across all message ranges for the candidate MPI Bcast collective operation. To provide a holistic view, we compare the performance of proposed enhanced designs for MPI Bcast, GPU-optimized NCCL Broadcast, and MPI+NCCL integrated MPI Bcast [4] using microbenchmarks and real-world DL frameworks. Along with the broad goal of developing an efficient MPI Bcast design for DL workloads, we make the following key contributions in this paper:…”

Section: A Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Awan

Chu

Subramoni

et al. 2018

Proceedings of the 25th European MPI Users' Group Meeting

View full text Add to dashboard Cite

Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, specialpurpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra-and inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.

show abstract

“…For instance, Glaser et al [6] implement strong scaling versions of general-purpose molecular dynamics simulations on GPUs, and Lončar et al use it too in the aforementioned solver from [10]. Deep learning and data analytics are other scopes that are taking advantage of CUDA-aware MPI implementations, for example by exploiting it to support efficient large message broadcast operations [1]. That work also exploits NCCL in order to optimize intra-node communications among directly-connected GPUs.…”

Section: Related Workmentioning

confidence: 99%

A multi-device version of the HYFMGPU algorithm for hyperspectral scenes registration

et al. 2018

View full text Add to dashboard Cite

Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms were not really devoted to real-time performance, even when ported to GPUs or other parallel devices. Thus, the HYFMGPU algorithm arose as a solution to such a lack. Nevertheless, as sensors are expected to evolve and thus generate images with finer resolutions and wider wavelength ranges, a multi-GPU implementation of this algorithm seems to be necessary in a near future. This work presents a multi-device MPI+CUDA implementation of the HYFMGPU algorithm that distributes all its stages among several GPUs. This version has been validated testing it for 5 different real hyperspectral images, with sizes from about 80 MB to nearly 2 GB, achieving speedups for the whole execution of the algorithm from 1.18× to 1.59× in 2 GPUs and from 1.26× to 2.58× in 4 GPUs. The parallelization efficiencies obtained are stable around 86% and 78% for 2 and 4 GPUs respectively, which proves the scalability of this multi-device version.

show abstract

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Cited by 42 publications

References 10 publications

High Performance Data Engineering Everywhere

High Performance Data Engineering Everywhere

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

A multi-device version of the HYFMGPU algorithm for hyperspectral scenes registration

Contact Info

Product

Resources

About