Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay Anand; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xinshan; Shoeybi, Mohammad; He, Yuxiong; Houston, Michael E.; Tiwary, Saurabh; Catanzaro, Bryan

doi:10.48550/arxiv.2201.11990

Cited by 123 publications

(167 citation statements)

References 30 publications

(67 reference statements)

Supporting

Mentioning

167

Contrasting

Order By: Relevance

“…18 About one year after GPT-3 was announced, a spike in similar model announcements followed. These models were developed by both large and small private organizations from around the world: Jurassic-1-Jumbo [46], AI21 Labs, Israel; Ernie 3.0 Titan [70], Baidu, China; Gopher [56], DeepMind, USA/UK; FLAN [71] & LaMDA [68], Google, USA; Pan Gu [78] Huawei, China; Yuan 1.0 [76], Inspur, China; Megatron Turing NLG [64], Microsoft & NVIDIA, USA; and HyperClova [43], Naver, Korea. This suggests that the economic incentives to build such models, and the prestige incentives to announce them, are quite strong.…”

Section: Large Language Models Are Rapidly Proliferatingmentioning

confidence: 99%

“…Scaling up the amount of data, compute power, and model parameters of neural networks has recently led to the arrival (and real world deployment) of capable generative models such as CLIP [55], Ernie 3.0 Titan [70], FLAN [71], Gopher [56], GPT-3 [11], HyperClova [43], Jurassic-1-Jumbo [46], Megatron Turing NLG [64], LaMDA [68], Pan Gu [78], Yuan 1.0 [76], and more. For this class of models 4 the relationship between scale and model performance is often so predictable that it can be described in a lawful relationship -a scaling law.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

Section: Large Language Models Are Rapidly Proliferatingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Comparable projects from other research groups include model libraries such as fairseq (Ott et al, 2019), large-scale parallelism libraries such as FairScale (Baines et al, 2021), and libraries that include both kinds of functionality such as DeepSpeed (Rasley et al, 2020) and Megatron (Smith et al, 2022). Some major differentiators of t5x are its use of JAX and Flax for model expression, its support for TPU (including TPU v4), and its Gin-based configuration system that allows uses to modify nearly everything about the model and training procedure.…”

Section: Related Workmentioning

confidence: 99%

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Roberts¹,

Chung²,

Levskaya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures.t5x and seqio are open source and available at https://github.com/google-research/ t5x and https://github.com/google/seqio, respectively.

show abstract

“…This brought about modern deep-learning systems which have revolutionized computer vision [24,29,46], natural language processing [25,55] and even biology [26]. Even in the current era of deep learning, significant network architectural breakthroughs [55] were made possible and are shown to scale with larger datasets and more computation [4,16,48,49].…”

Section: Introductionmentioning

confidence: 99%

Accelerated Quality-Diversity for Robotics through Massive Parallelism

Lim¹,

Allard²,

Luca³

et al. 2022

Preprint

View full text Add to dashboard Cite

Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.

show abstract

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Cited by 123 publications

References 30 publications

Predictability and Surprise in Large Generative Models

Predictability and Surprise in Large Generative Models

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Accelerated Quality-Diversity for Robotics through Massive Parallelism

Contact Info

Product

Resources

About