1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Tang, Hanlin; Gan, Shaoduo; Awan, Ammar Ahmad; Rajbhandari, Samyam; Li, Conglong; Lian, Xiangru; Liu, Ji; Zhang, Ce; He, Yuxiong

doi:10.48550/arxiv.2102.02888

Cited by 5 publications

(20 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bernstein et al (2018b); Sohn et al (2019); Le Phong & Phuong (2020); Lyu (2021) investigate the robustness of 1-bit SGD. Perhaps the closest works to this paper are (Tang et al, 2021;Li et al, 2021), which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively. Among all the variants of 1-bit communication, the design with error feedback mechanism has shown to work best both empirically (Seide et al, 2014) and theoretically (Karimireddy et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

“…Limitations of the state-of-the-art 1-bit Adam. Tang et al (2021) undertook the first investigation of this question and proposed 1-bit Adam. The algorithm follows a two-stage training paradigm: first run Adam with full-precision communication (full-precision stage 1 ); and then switch to 1 bit when the variance state, i.e.…”

Section: Introductionmentioning

confidence: 99%

“…the running average of second moment gradient in Adam, becomes stable (compression stage). While this paradigm allows drastic data volume reduction over original Adam, the experimental results from (Tang et al, 2021) indicate the full-precision stage still incurs non-trivial overhead. For instance, it is shown that when training the BERT-Large model, 1-bit Adam sends each parameter with 5.69 bits on average, which is not close to 1 bit as we would expect.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Lu¹,

Li²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (e.g. BERT and GPT). In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communication rounds with bit-free synchronization over Adam's optimizer states, momentum and variance. In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex objectives, and show the complexity bound is better than original Adam under certain conditions. On various benchmarks such as BERT-Base/Large pretraining and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 90% of data volume, 54% of communication rounds, and achieve up to 2× higher throughput compared to the state-of-the-art 1-bit Adam while enjoying the same statistical convergence speed and end-to-end model accuracy on GLUE dataset and ImageNet validation set.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Lu¹,

Li²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…These two design decisions enable the flexibility and efficiency of BAGUA-to implement a new advanced algorithm with system relaxation (e.g., 1-big Adam [79] or Decentralized SGD [15]), in BAGUA, a developer does not need to worry about manually balancing communications with computations; instead, she can specify, at a high-level, the logical semantics and BAGUA will automatically optimize its execution. In this section, we first provide a high-level system overview, followed by a descriptions of these primitives and their implementations, and then the simple, but effective, optimization framework in BAGUA.…”

Section: (Optimizations) How Should One Optimize the End-to-end Execu...mentioning

confidence: 99%

“…QSGD [4], a quantized (8-bit) DP-SG algorithm, implemented with C LP S primitive without error compensation. 1-bit Adam [79], a quantized (1-bit) distributed learning algorithm, implemented with by C LP S primitive with error compensation. Decen-32bits, a decentralized training algorithm with the random probing method to exchange the model parameters in each iteration, implemented with D FP S. Decen-8bits [17], a ring-based decentralized training algorithm with quantization, implemented with D LP S. Async, asynchronous centralized DP-SG.…”

Section: Bagua Algorithmsmentioning

confidence: 99%

BAGUA: Scaling up Distributed Learning with System Relaxations

Gan¹,

Lian²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-ofthe-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95×) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.

show abstract

Bagua

et al. 2021

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build Bagua, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2X) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.

show abstract

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Cited by 5 publications

References 5 publications

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

BAGUA: Scaling up Distributed Learning with System Relaxations

Bagua

Contact Info

Product

Resources

About