2022
DOI: 10.48550/arxiv.2202.06009
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Abstract: 1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (e.g. BERT and GPT). In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 33 publications
0
1
0
Order By: Relevance
“…In addition, the zero redundancy optimizer (ZeRO) facilitates memory efficiency by partitioning model states and gradients. In addition, 1-bit Adam [292], 0/1 Adam [293], and 1-bit LAMB [294] optimizers reduce the communication resource demand in DeepSpeed. Sparse attention kernels support long sequence input and sparse structures with faster execution and comparable performance.…”
Section: Software Framework For Large-scale Distributed Trainingmentioning
confidence: 99%
“…In addition, the zero redundancy optimizer (ZeRO) facilitates memory efficiency by partitioning model states and gradients. In addition, 1-bit Adam [292], 0/1 Adam [293], and 1-bit LAMB [294] optimizers reduce the communication resource demand in DeepSpeed. Sparse attention kernels support long sequence input and sparse structures with faster execution and comparable performance.…”
Section: Software Framework For Large-scale Distributed Trainingmentioning
confidence: 99%