Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and finetuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
In this paper, we study the problem of automatically solving ruler/compass based geometry construction problems. We first introduce a logic and a programming language for describing such constructions and then phrase the automation problem as a program synthesis problem. We then describe a new program synthesis technique based on three key insights: (i) reduction of symbolic reasoning to concrete reasoning (based on a deep theoretical result that reduces verification to random testing), (ii) extending the instruction set of the programming language with higher level primitives (representing basic constructions found in textbook chapters, inspired by how humans use their experience and knowledge gained from chapters to perform complicated constructions), and (iii) pruning the forward exhaustive search using a goal-directed heuristic (simulating backward reasoning performed by humans). Our tool can successfully synthesize constructions for various geometry problems picked up from high-school textbooks and examination papers in a reasonable amount of time. This opens up an amazing set of possibilities in the context of making classroom teaching interactive.
Abstract-We consider Markov Decision Processes (MDPs) as transformers on probability distributions, where with respect to a scheduler that resolves nondeterminism, the MDP can be seen as exhibiting a behavior that is a sequence of probability distributions. Defining propositions using linear inequalities, one can reason about executions over the space of probability distributions. In this framework, one can analyze properties that cannot be expressed in logics like PCTL * , such as expressing bounds on transient rewards and expected values of random variables, and comparing the probability of being in one set of states at a given time with that of being in another set of states. We show that model checking MDPs with this semantics against ω-regular properties is in general undecidable. We then identify special classes of propositions and schedulers with respect to which the model checking problem becomes decidable. We demonstrate the potential usefulness of our results through an example.
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges; unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices.In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data paralleism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We discuss various implementations of pipeline parallelism and propose a novel schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. The composition of these techniques allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code has been open-sourced at https://github.com/nvidia/megatron-lm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.