2021
DOI: 10.48550/arxiv.2102.02888
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Abstract: Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

4
1

Authors

Journals

citations
Cited by 5 publications
(20 citation statements)
references
References 5 publications
0
20
0
Order By: Relevance
“…Bernstein et al (2018b); Sohn et al (2019); Le Phong & Phuong (2020); Lyu (2021) investigate the robustness of 1-bit SGD. Perhaps the closest works to this paper are (Tang et al, 2021;Li et al, 2021), which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively. Among all the variants of 1-bit communication, the design with error feedback mechanism has shown to work best both empirically (Seide et al, 2014) and theoretically (Karimireddy et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Bernstein et al (2018b); Sohn et al (2019); Le Phong & Phuong (2020); Lyu (2021) investigate the robustness of 1-bit SGD. Perhaps the closest works to this paper are (Tang et al, 2021;Li et al, 2021), which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively. Among all the variants of 1-bit communication, the design with error feedback mechanism has shown to work best both empirically (Seide et al, 2014) and theoretically (Karimireddy et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Limitations of the state-of-the-art 1-bit Adam. Tang et al (2021) undertook the first investigation of this question and proposed 1-bit Adam. The algorithm follows a two-stage training paradigm: first run Adam with full-precision communication (full-precision stage 1 ); and then switch to 1 bit when the variance state, i.e.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…These two design decisions enable the flexibility and efficiency of BAGUA-to implement a new advanced algorithm with system relaxation (e.g., 1-big Adam [79] or Decentralized SGD [15]), in BAGUA, a developer does not need to worry about manually balancing communications with computations; instead, she can specify, at a high-level, the logical semantics and BAGUA will automatically optimize its execution. In this section, we first provide a high-level system overview, followed by a descriptions of these primitives and their implementations, and then the simple, but effective, optimization framework in BAGUA.…”
Section: (Optimizations) How Should One Optimize the End-to-end Execu...mentioning
confidence: 99%
“…QSGD [4], a quantized (8-bit) DP-SG algorithm, implemented with C LP S primitive without error compensation. 1-bit Adam [79], a quantized (1-bit) distributed learning algorithm, implemented with by C LP S primitive with error compensation. Decen-32bits, a decentralized training algorithm with the random probing method to exchange the model parameters in each iteration, implemented with D FP S. Decen-8bits [17], a ring-based decentralized training algorithm with quantization, implemented with D LP S. Async, asynchronous centralized DP-SG.…”
Section: Bagua Algorithmsmentioning
confidence: 99%