2021
DOI: 10.48550/arxiv.2107.01499
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BAGUA: Scaling up Distributed Learning with System Relaxations

Abstract: Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) base… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 54 publications
0
6
0
Order By: Relevance
“…Data parallelism (DP). The most common way to accelerate the model training is DP [14,23,40], where the data is separate on different workers while each worker holds a model replica and performs collective primitives such as AllReduce [32] at a certain interval to keep model synchronously. However, when the model is too large, a single-GPU memory cannot hold an entire model.…”
Section: Background and Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Data parallelism (DP). The most common way to accelerate the model training is DP [14,23,40], where the data is separate on different workers while each worker holds a model replica and performs collective primitives such as AllReduce [32] at a certain interval to keep model synchronously. However, when the model is too large, a single-GPU memory cannot hold an entire model.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Another worth noting case is with GBS 64 and 128 on 64 GPUs, where Merak only gets acceleration of 19.4%-21.9%. This is because large DP degrees (8)(9)(10)(11)(12)(13)(14)(15)(16) and small GBS result in a small number of microbatch, and the DP communication and model updates occupy a considerable portion of runtime. Merak performs well in other situations, with up to 41.7% performance gains.…”
Section: End-to-end Training Performancementioning
confidence: 99%
“…The optimization of AllReduce communication paradigm among NN workers in Persia is the key for hiding communication overhead within the backward computation of the neural network. This functionality is implemented based on Bagua [29], an open-source generalpurposed distributed learning system optimized for data parallelism, also released by Kwai. Currently, Persia utilizes Bagua's centralized synchronous full-precision communication primitive (equivalent to AllReduce) by default, in an attempt to preserve the accuracy.…”
Section: Communication Optimizationmentioning
confidence: 99%
“…Popular options include TensorFlow [11], PyTorch [8], MXNet [22], PaddlePaddle [9], MindSpore [6], etc. Extensions and modifications have been made based on these general purpose learning systems for efficient distributed learning (e.g., Horovod [73], BytePS [41], Bagua [29], Megatron [75], ZeRO [69], SageMaker [42], etc.). However, even including these extensions, the current general purpose deep learning systems do not consider the challenges about handling the heterogeneity over a hybrid infrastructure.…”
Section: Distributed Deep Learningmentioning
confidence: 99%
“…BAGUA [69] is a recent open-source library that supports both global and partial averaging, offers full-and low-precision operations, and focuses on efficient deep learning. It does not support asynchronous communication, diverse and time-varying network topologies, and directed communications in pull-and push styles, which are supported by BlueFog to implement algorithms such as push-sum [3] and push-pull [70], [71], as well as more recent decentralized algorithms using those features.…”
Section: B Related Workmentioning
confidence: 99%