BAGUA: Scaling up Distributed Learning with System Relaxations

Gan, Shaoduo; Lian, Xiangru; Wang, Rui; Chang, Jianbin; Liu, Chengjun; Shi, Hongmei; Zhang, Shengzhuo; Li, Xianghong; Sun, Tengxu; Jiang, Jiawei; Yuan, Binhang; Yang, Sen; Liu, Ji; Zhang, Ce

doi:10.48550/arxiv.2107.01499

Cited by 4 publications

(7 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data parallelism (DP). The most common way to accelerate the model training is DP [14,23,40], where the data is separate on different workers while each worker holds a model replica and performs collective primitives such as AllReduce [32] at a certain interval to keep model synchronously. However, when the model is too large, a single-GPU memory cannot hold an entire model.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Another worth noting case is with GBS 64 and 128 on 64 GPUs, where Merak only gets acceleration of 19.4%-21.9%. This is because large DP degrees (8)(9)(10)(11)(12)(13)(14)(15)(16) and small GBS result in a small number of microbatch, and the DP communication and model updates occupy a considerable portion of runtime. Merak performs well in other situations, with up to 41.7% performance gains.…”

Section: End-to-end Training Performancementioning

confidence: 99%

See 1 more Smart Citation

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Lai¹,

Li²,

Tang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency.To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42×, 1.39×, 1.43×, and 1.61×, respectively.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: End-to-end Training Performancementioning

confidence: 99%

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Lai¹,

Li²,

Tang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The optimization of AllReduce communication paradigm among NN workers in Persia is the key for hiding communication overhead within the backward computation of the neural network. This functionality is implemented based on Bagua [29], an open-source generalpurposed distributed learning system optimized for data parallelism, also released by Kwai. Currently, Persia utilizes Bagua's centralized synchronous full-precision communication primitive (equivalent to AllReduce) by default, in an attempt to preserve the accuracy.…”

Section: Communication Optimizationmentioning

confidence: 99%

“…Popular options include TensorFlow [11], PyTorch [8], MXNet [22], PaddlePaddle [9], MindSpore [6], etc. Extensions and modifications have been made based on these general purpose learning systems for efficient distributed learning (e.g., Horovod [73], BytePS [41], Bagua [29], Megatron [75], ZeRO [69], SageMaker [42], etc.). However, even including these extensions, the current general purpose deep learning systems do not consider the challenges about handling the heterogeneity over a hybrid infrastructure.…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale-from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation-the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies up to 100 trillion parameters have been conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

show abstract

“…BAGUA [69] is a recent open-source library that supports both global and partial averaging, offers full-and low-precision operations, and focuses on efficient deep learning. It does not support asynchronous communication, diverse and time-varying network topologies, and directed communications in pull-and push styles, which are supported by BlueFog to implement algorithms such as push-sum [3] and push-pull [70], [71], as well as more recent decentralized algorithms using those features.…”

Section: B Related Workmentioning

confidence: 99%

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Ying¹,

Yuan²,

Hu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a central node. Recently, developing decentralized algorithms for deep learning has attracted great attention. They are considered as low-communication-overhead alternatives to those using a parameter server or the Ring-Allreduce protocol. However, the lack of an easy-to-use and efficient software package has kept most decentralized algorithms merely on paper. To fill the gap, we introduce BlueFog, a python library for straightforward, high-performance implementations of diverse decentralized algorithms. Based on a unified abstraction of various communication operations, BlueFog offers intuitive interfaces to implement a spectrum of decentralized algorithms, from those using a static, undirected graph for synchronous operations to those using dynamic and directed graphs for asynchronous operations. BlueFog also adopts several system-level acceleration techniques to further optimize the performance on the deep learning tasks. On mainstream DNN training tasks, BlueFog reaches a much higher throughput and achieves an overall 1.2× ∼ 1.8× speedup over Horovod, a state-of-the-art distributed deep learning package based on Ring-Allreduce. BlueFog is open source at https://github.com/Bluefog-Lib/bluefog.

show abstract

BAGUA: Scaling up Distributed Learning with System Relaxations

Cited by 4 publications

References 54 publications

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Contact Info

Product

Resources

About