Bandwidth optimal all-reduce algorithms for clusters of workstations

Patarasuk, Pitch; Yuan, Xin

doi:10.1016/j.jpdc.2008.09.002

Cited by 300 publications

(148 citation statements)

References 26 publications

Supporting

Mentioning

147

Contrasting

Order By: Relevance

“…In contrast to PS, All-Reduce replaces the use of central nodes with carefully scheduled global communication to achieve better parallelism. The state-of-the-art solutions [31,41,45] leverage Ring All-Reduce [38], the advanced all-reduce algorithm that effectively utilizes the bandwidth between computation devices. Specifically, workers are organized as a ring, and gradients are divided into chunks and passed over the ring in a parallel manner.…”

Section: Existing Synchronization Approachesmentioning

confidence: 99%

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Luo

Zhuo

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds -designing a distributed training method that has both high performance as All-Reduce in homogeneous environment and good heterogeneity tolerance as AD-PSGD?In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive Partial All-Reduce that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is 1.1× faster than the state-of-the-art implementation of All-Reduce, 5.1× faster than Parameter Server and 4.3× faster than AD-PSGD. In a heterogeneous setting, Ripples shows 2× speedup over All-Reduce, and still obtains 3× speedup over the Parameter Server baseline.

show abstract

Section: Existing Synchronization Approachesmentioning

confidence: 99%

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Luo

Zhuo

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

show abstract

“…In contrast, homogeneous models enforced by S-PSGD cannot convergence with a large batch size and aggressive learning rate for our ASR task setting [4]. A good allreduce implementation can finish each round of communication after effectively 2 messages are sent across the communication network, independent of the number of learners [6]. We choose the Nvidia NCCL [7] as our allreduce implementation.…”

Section: Design and Implementationmentioning

confidence: 99%

Improving Efficiency in Large-Scale Decentralized Distributed Training

Zhang¹,

Cui²,

Kayi³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

show abstract

“…We replace the original gRPC implementation with Message Passing Interface (MPI) and NVIDIA Collective Communications Library (NCCL) [8]. NCCL provides a highly optimized version of routines, such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, and the integrated bandwidth-optimal ring all-reduce algorithm [33], to achieve high bandwidth over PCIe on NVIDIA GPU. In order to scale from one GPU to multiple nodes and multiple GPUs, we implement several APIs for communication: 1) a broadcast operation to synchronize parameters among all GPUs at the initialization stage or the recovery from the checkpoint; 2) a distributed optimizer wrapper for synchronization update of parameters; 3) some operations for data partition and barrier, etc.…”

Section: Acceleration By Distributed Trainingmentioning

confidence: 99%

Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning

et al. 2019

View full text Add to dashboard Cite

In existing visual representation learning tasks, deep convolutional neural networks (CNNs) are often trained on images annotated with single tag, such as ImageNet. However, single tag annotation cannot describe all important contents of one image, and some useful visual information may be wasted during training. In this work, we propose to train CNNs from images annotated with multiple tags, to enhance the quality of visual representation of the trained CNN model. To this end, we build a large-scale multi-label image database with 18M images and 11K categories, dubbed Tencent ML-Images. We efficiently train the ResNet-101 model with multi-label outputs on Tencent ML-Images, taking 90 hours for 60 epochs, based on a large-scale distributed deep learning framework, i.e., TFplus. The good quality of the visual representation of the Tencent ML-Images checkpoint is verified through three transfer learning tasks, including single-label image classification on ImageNet and Caltech-256, object detection on PASCAL VOC 2007, and semantic segmentation on PASCAL VOC 2012. The Tencent ML-Images database, the checkpoints of ResNet-101, and all the training code have been released at https://github.com/Tencent/tencent-ml-images. It is expected to promote other vision tasks in the research and industry community.

show abstract

Bandwidth optimal all-reduce algorithms for clusters of workstations

Cited by 300 publications

References 26 publications

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Improving Efficiency in Large-Scale Decentralized Distributed Training

Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning

Contact Info

Product

Resources

About