Blink: Fast and Generic Collectives for Distributed ML

Wang, Guanhua; Venkataraman, Shivaram; Phanishayee, Amar; Thelin, Jorgen; Devanur, Nikhil R.; Stoica, Ion

doi:10.48550/arxiv.1910.04940

Cited by 7 publications

(7 citation statements)

References 17 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 summarizes some recent distributed training solutions by marking which scheme they can support. Besides advances in training schemes, prior work has also explored different communication algorithms, including treebased AllReduce [22], heterogeneity-aware interconnection structure [39], and AllReduce decomposition [14]. As this paper focuses on DDP, the remainder of this section only elaborates and compares closely related techniques, i.e., Synchronous, Intra-iteration, and Data parallel training schemes.…”

Section: Related Workmentioning

confidence: 99%

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li,

Zhao,

Varma

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. Py-Torch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs. * This work was conducted when Pieter Noordhuis was an employee at Facebook.

show abstract

Section: Related Workmentioning

confidence: 99%

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li,

Zhao,

Varma

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Two major challenges of modern large-scale systems are the need for faster collective communication operations [51,67] and topology-aware scheduling [7,72]. Recent works like topologyaware scheduling [7] and Gandiva [72] have motivated the importance of optimal placements to improve performance of Machine Learning (ML) workloads within multi-GPU environments by efficiently utilizing inter-accelerator interconnection link.…”

Section: Cpu Gpumentioning

confidence: 99%

“…Collective communication: In [33,67], the authors have proposed techniques towards achieving efficient collective communication. Blink [67] offers a new approach to collective communication by creating sets of spanning trees instead of rings. The spanning trees are dynamically generated based on the topology detected to utilize the links best.…”

Section: Related Workmentioning

confidence: 99%

MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers

Ranganath,

Suetterlein,

Manzano

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-accelerator servers are increasingly being deployed in shared multi-tenant environments (such as in cloud data centers) in order to meet the demands of large-scale compute-intensive workloads. In addition, these accelerators are increasingly being inter-connected in complex topologies and workloads are exhibiting a wider variety of inter-accelerator communication patterns. However, existing allocation policies are ill-suited for these emerging use-cases. Specifically, this work identifies that multi-accelerator workloads are commonly fragmented leading to reduced bandwidth and increased latency for inter-accelerator communication.We propose Multi-Accelerator Pattern Allocation (MAPA), a graph pattern mining approach towards providing generalized allocation support for allocating multi-accelerator workloads on multi-accelerator servers. We demonstrate that MAPA is able to improve the execution time of multi-accelerator workloads and that MAPA is able to provide generalized benefits across various accelerator topologies. Finally, we demonstrate a speedup of 12.4% for 75th percentile of jobs with the worst case execution time reduced by up to 35% against baseline policy using MAPA.

show abstract

“…For instance, NVLink and NVSwitch (NVL, 2020) are widely used for intra-machine GPU-to-GPU interconnection and RDMA (Xue et al, 2019;Jiang et al, 2020) for intra-rack communication. Unfortunately, even with these advanced hardware, the performance of large-scale distributed training is still far from near-linear scalability because of the large model sizes (Wang et al, 2019;Zhang et al, 2020). Besides sparsification and quantization algorithms, low-rank compression algorithms (Vogels et al, 2019;Cho et al, 2019;Idelbayev & Carreira-Perpinán, 2020) are also proposed to reduce the communicated data size.…”

Section: Related Workmentioning

confidence: 99%

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Wang,

Wu,

2021

Preprint

View full text Add to dashboard Cite

Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training.In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83× without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.

show abstract

Blink: Fast and Generic Collectives for Distributed ML

Cited by 7 publications

References 17 publications

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Contact Info

Product

Resources

About