Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Awan, Ammar Ahmad; Bédorf, Jeroen; Chu, C. Y. Cyrus; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1109/ccgrid.2019.00064

Cited by 31 publications

(15 citation statements)

References 21 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NCCL is a set of powerful collective communication primitives for GPU which has already demonstrated accelerated performance for deep learning applications [29], [5], [7], [6], [8]. However, the utilization of NCCL for NMF has been unexplored so far.…”

Section: Rationale For Pydnmf-gpumentioning

confidence: 99%

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

Boureima¹,

Bhattarai²,

Eren³

et al. 2022

Preprint

View full text Add to dashboard Cite

The need for efficient and scalable big-data analytics methods is more essential than ever due to the exploding size and complexity of globally emerging datasets. Nonnegative Matrix Factorization (NMF) is a well-known explainable unsupervised learning method for dimensionality reduction, latent feature extraction, blind source separation, data mining, and machine learning. In this paper, we introduce a new distributed out-of-core NMF method, named pyDNMF-GPU, designed for modern heterogeneous CPU/GPU architectures that is capable of factoring exascale-sized dense and sparse matrices. Our method reduces the latency associated with local data transfer between the GPU and host using CUDA streams, and reduces the latency associated with collective communications (both intra-node and inter-node) via NCCL primitives. In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores, resulting in good scalability. We set new benchmarks for the size of the data being analyzed: in experiments, we measure up to 76x improvement on a single GPU over running on a single 18 core CPU and we show good weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs, when decomposing a dense 340 Terabyte-size matrix and a 11 Exabyte-size sparse matrix of density 10 −6 . Finally, we integrate our method with an automatic model selection method. With this integration, we introduce a new tool that is capable of analyzing, compressing, and discovering explainable latent structures in extremely large sparse and dense data.

show abstract

Section: Rationale For Pydnmf-gpumentioning

confidence: 99%

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

Boureima¹,

Bhattarai²,

Eren³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There exist a few works that specifically evaluate and/or improve the MPI CCPs for DL, for example, taking into account the special characteristics of the messages that are exchanged in this type of applications [3,4,18,23]. In addition, MPI-based software has been developed for distributed DNN training; for example, MVAPICH2-GDR 1 from Ohio State University or oneAPI 2 from Intel.…”

Section: Mpi Collective Communication Primitivesmentioning

confidence: 99%

“…There exist a number of instances of the MPI library, with some prominent examples being OpenMPI, 3 MPICH, 4 MVAPICH, 5 and Intel MPI. 6 All these implementations adhere to the functionality and specification defined by the MPI API, while distinct realizations of the standard vary in the implementation of the primitives and, quite often, the performance they attain.…”

Section: A Family Of Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Accelerating distributed deep neural network training with pipelined MPI allreduce

2021

View full text Add to dashboard Cite

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.

show abstract

“…Several efforts have focused on the investigation of the behavior of deep learning applications from different perspectives: performance and power characteristics [19], scalability and fine-tuning [20], GPU optimizations [21], I/O workloads [22], [23]. However, a systematic understanding of fine-grain behavior at tensor level that explains the interplay of layer-wise pipelining and all-reduce in synchronous data parallel training is missing.…”

Section: B Horovodmentioning

confidence: 99%

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Nicolae

Wozniak

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved with all-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weight updates, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.

show abstract

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Cited by 31 publications

References 21 publications

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

Accelerating distributed deep neural network training with pipelined MPI allreduce

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Contact Info

Product

Resources

About