Efficient large-scale language model training on GPU clusters using megatron-LM

Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared; LeGresley, Patrick; Patwary, Mostofa; Korthikanti, Vijay Anand; Vainbrand, Dmitri; Kashinkunti, Prethvi; Bernauer, Julie; Catanzaro, Bryan; Phanishayee, Amar; Zaharia, Matei

doi:10.1145/3458817.3476209

Cited by 202 publications

(191 citation statements)

References 3 publications

Supporting

Mentioning

190

Contrasting

Order By: Relevance

“…Religion Analysis For religion, we perform an analysis of the co-occurring words similar to gender and ethnicity. The three prompts used for religion are "The {religious practitioners} are", "The {religious practitio Atheism belief (20) , think (40) , science (43) , lack (53) , reason (54) , preach (62) , existence (63) , thinking (76) , angry (80) , human (81) Buddhism compassion (13) , mindfulness (15) , Buddha (17) , monk (21) , mind (23) , robes (24) , calm (30) , peaceful (32) , living (44) , chanting (46) Christianity Christ (16) , Jesus (17) , bible (34) , told (45) , forced (69) , families (73) , giving (74) , charity (77) , poor (82) , churches (86) Hinduism yoga (11) , India (14) , tolerance (23) , caste (44) , traditions (46) , Indian (50) , system (59) , husband (60) , skin (68) , respect (72) Islam hijab (11) , modesty (27) , prophet (34) , law (35) , cover (47) , Allah (55) , face (57) , mosque (59) , countries (65) , veil (67) Judaism Jewish (8) , white…”

Section: Male Identifiersmentioning

confidence: 99%

“…Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. Through a collaboration between NVIDIA Megatron-LM [63,43] and Microsoft DeepSpeed [57,65], we created an efficient and scalable 3D parallel system capable of combining data, pipeline, and tensor-slicing based parallelism. By combining tensor-slicing and pipeline parallelism, we can operate within the regime where they are most effective.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith¹,

Patwary²,

Norick³

et al. 2022

Preprint

Self Cite

123

133

View full text Add to dashboard Cite

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and finetuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

show abstract

Section: Male Identifiersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith¹,

Patwary²,

Norick³

et al. 2022

Preprint

Self Cite

123

133

View full text Add to dashboard Cite

show abstract

“…However, unlike CoCoNet, PyTorch's DDP requires extra memory for overlapping, which can increase training time for very large models [9] and do not support slicing of optimizer parameter update that significantly decrease memory usage. GPipe [26], Pipedream [38], and Narayanan et al [39] proposed pipeline training to improve model parallelism, by dividing the forward and backward pass into several mini-batches, which are then pipelined across devices. vPipe [53] improves these works by providing higher GPU utilization.…”

Section: Related Workmentioning

confidence: 99%

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Jangda

Huang

Liu

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone.Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. Co-CoNet enabled us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.

show abstract

“…Transformer models Vaswani et al [2017] have attracted increasing interest and shown excellent performance in domains such as natural language processing (NLP) Vaswani et al [2017], Devlin et al [2019], Radford et al [2019], vision Dosovitskiy et al [2021] or graphs Ying et al [2021], Yun et al [2019]. Yet, their typically very high complexity (up to billions of parameters Narayanan et al [2021]) makes these models notoriously intransparent and their predictions inaccessible to the user. Since Transformer models have heavy application in potentially sensitive domains, e.g.…”

Section: Introductionmentioning

confidence: 99%

XAI for Transformers: Better Explanations through Conservative Propagation

Ameen¹,

Schnake²,

Eberle³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.

show abstract

Efficient large-scale language model training on GPU clusters using megatron-LM

Cited by 202 publications

References 3 publications

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

XAI for Transformers: Better Explanations through Conservative Propagation

Contact Info

Product

Resources

About