2019
DOI: 10.48550/arxiv.1910.04940
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Blink: Fast and Generic Collectives for Distributed ML

Abstract: Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage he… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 17 publications
(25 reference statements)
0
7
0
Order By: Relevance
“…Table 1 summarizes some recent distributed training solutions by marking which scheme they can support. Besides advances in training schemes, prior work has also explored different communication algorithms, including treebased AllReduce [22], heterogeneity-aware interconnection structure [39], and AllReduce decomposition [14]. As this paper focuses on DDP, the remainder of this section only elaborates and compares closely related techniques, i.e., Synchronous, Intra-iteration, and Data parallel training schemes.…”
Section: Related Workmentioning
confidence: 99%
“…Table 1 summarizes some recent distributed training solutions by marking which scheme they can support. Besides advances in training schemes, prior work has also explored different communication algorithms, including treebased AllReduce [22], heterogeneity-aware interconnection structure [39], and AllReduce decomposition [14]. As this paper focuses on DDP, the remainder of this section only elaborates and compares closely related techniques, i.e., Synchronous, Intra-iteration, and Data parallel training schemes.…”
Section: Related Workmentioning
confidence: 99%
“…Two major challenges of modern large-scale systems are the need for faster collective communication operations [51,67] and topology-aware scheduling [7,72]. Recent works like topologyaware scheduling [7] and Gandiva [72] have motivated the importance of optimal placements to improve performance of Machine Learning (ML) workloads within multi-GPU environments by efficiently utilizing inter-accelerator interconnection link.…”
Section: Cpu Gpumentioning
confidence: 99%
“…Collective communication: In [33,67], the authors have proposed techniques towards achieving efficient collective communication. Blink [67] offers a new approach to collective communication by creating sets of spanning trees instead of rings. The spanning trees are dynamically generated based on the topology detected to utilize the links best.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, NVLink and NVSwitch (NVL, 2020) are widely used for intra-machine GPU-to-GPU interconnection and RDMA (Xue et al, 2019;Jiang et al, 2020) for intra-rack communication. Unfortunately, even with these advanced hardware, the performance of large-scale distributed training is still far from near-linear scalability because of the large model sizes (Wang et al, 2019;Zhang et al, 2020). Besides sparsification and quantization algorithms, low-rank compression algorithms (Vogels et al, 2019;Cho et al, 2019;Idelbayev & Carreira-Perpinán, 2020) are also proposed to reduce the communicated data size.…”
Section: Related Workmentioning
confidence: 99%