2021
DOI: 10.48550/arxiv.2104.06069
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

Abstract: To train large models (like BERT and GPT-3) with hundreds or even thousands of GPUs, the communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP interconnects network. On one side large-batch optimization such as LAMB algorithm was proposed to reduce the number of communications. On the other side, communication compression algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 7 publications
0
6
0
Order By: Relevance
“…This momentum is accomplished by introducing two new variables, namely, velocity and friction, as given by Eqs. ( 3) and (4), respectively [46].…”
Section: Sgd With Momentummentioning
confidence: 99%
See 1 more Smart Citation
“…This momentum is accomplished by introducing two new variables, namely, velocity and friction, as given by Eqs. ( 3) and (4), respectively [46].…”
Section: Sgd With Momentummentioning
confidence: 99%
“…Then, the resulting output is propagated into the model to lessen the difference. The DL architecture adjusts the weights and repeats the process until a convergence is achieved [46,77]. An algorithm is searched to speed up the learning process while producing the best results.…”
Section: Introductionmentioning
confidence: 99%
“…Bernstein et al (2018b); Sohn et al (2019); Le Phong & Phuong (2020); Lyu (2021) investigate the robustness of 1-bit SGD. Perhaps the closest works to this paper are (Tang et al, 2021;Li et al, 2021), which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively. Among all the variants of 1-bit communication, the design with error feedback mechanism has shown to work best both empirically (Seide et al, 2014) and theoretically (Karimireddy et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…t since the gradients are usually high-dimensional. Based on the profiling results from (Tang et al, 2021;Li et al, 2021), the communication of gradients could take up to 94% of the total training time on modern clusters. 1-bit compression (Liu et al, 2018) mitigates this problem by sending each gradient with only signs and a shared, usually the average over all the coordinates, magnitude.…”
Section: -Bit Adam and Its Limitationsmentioning
confidence: 99%
“…To improve the training efficiency, more recent works advocate large-batch training [15,23,50,51,55]. However, due to the limited device memory, practitioners have to resort to gradient accumulation that divides a large batch into multiple micro-batches and accumulates the gradient w.r.t.…”
Section: -Hop Gradient Synchronizationmentioning
confidence: 99%