2022
DOI: 10.48550/arxiv.2201.11990
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Abstract: Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and finetuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest mono… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
167
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 123 publications
(167 citation statements)
references
References 30 publications
(67 reference statements)
0
167
0
Order By: Relevance
“…18 About one year after GPT-3 was announced, a spike in similar model announcements followed. These models were developed by both large and small private organizations from around the world: Jurassic-1-Jumbo [46], AI21 Labs, Israel; Ernie 3.0 Titan [70], Baidu, China; Gopher [56], DeepMind, USA/UK; FLAN [71] & LaMDA [68], Google, USA; Pan Gu [78] Huawei, China; Yuan 1.0 [76], Inspur, China; Megatron Turing NLG [64], Microsoft & NVIDIA, USA; and HyperClova [43], Naver, Korea. This suggests that the economic incentives to build such models, and the prestige incentives to announce them, are quite strong.…”
Section: Large Language Models Are Rapidly Proliferatingmentioning
confidence: 99%
See 1 more Smart Citation
“…18 About one year after GPT-3 was announced, a spike in similar model announcements followed. These models were developed by both large and small private organizations from around the world: Jurassic-1-Jumbo [46], AI21 Labs, Israel; Ernie 3.0 Titan [70], Baidu, China; Gopher [56], DeepMind, USA/UK; FLAN [71] & LaMDA [68], Google, USA; Pan Gu [78] Huawei, China; Yuan 1.0 [76], Inspur, China; Megatron Turing NLG [64], Microsoft & NVIDIA, USA; and HyperClova [43], Naver, Korea. This suggests that the economic incentives to build such models, and the prestige incentives to announce them, are quite strong.…”
Section: Large Language Models Are Rapidly Proliferatingmentioning
confidence: 99%
“…Scaling up the amount of data, compute power, and model parameters of neural networks has recently led to the arrival (and real world deployment) of capable generative models such as CLIP [55], Ernie 3.0 Titan [70], FLAN [71], Gopher [56], GPT-3 [11], HyperClova [43], Jurassic-1-Jumbo [46], Megatron Turing NLG [64], LaMDA [68], Pan Gu [78], Yuan 1.0 [76], and more. For this class of models 4 the relationship between scale and model performance is often so predictable that it can be described in a lawful relationship -a scaling law.…”
Section: Introductionmentioning
confidence: 99%
“…Comparable projects from other research groups include model libraries such as fairseq (Ott et al, 2019), large-scale parallelism libraries such as FairScale (Baines et al, 2021), and libraries that include both kinds of functionality such as DeepSpeed (Rasley et al, 2020) and Megatron (Smith et al, 2022). Some major differentiators of t5x are its use of JAX and Flax for model expression, its support for TPU (including TPU v4), and its Gin-based configuration system that allows uses to modify nearly everything about the model and training procedure.…”
Section: Related Workmentioning
confidence: 99%
“…This brought about modern deep-learning systems which have revolutionized computer vision [24,29,46], natural language processing [25,55] and even biology [26]. Even in the current era of deep learning, significant network architectural breakthroughs [55] were made possible and are shown to scale with larger datasets and more computation [4,16,48,49].…”
Section: Introductionmentioning
confidence: 99%