Lin, Xi Victoria scite author profile

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero-and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, 1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models. * Equal contribution. † Work done while at Meta AI. 1 Following Brown et al. (2020), we use GPT-3 to refer to both the 175B model and the smaller scale models as well.2 Exceptions include work by EleutherAI, who released dense models up to 20B in size (Black et al., 2022), Salesforce (Nijkamp et al., 2022), and Meta AI, who released dense models up to 13B and sparse models up to 1. 1T (Artetxe et al., 2021). There is also ongoing work from the BigScience workshop (https://bigscience. huggingface.co/), which aims to open source very large multilingual language models and datasets.

show abstract

Efficient Large Scale Language Modeling with Mixtures of Experts

Artetxe¹,

Bhosale²,

Goyal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in-and out-of-domain language modeling, zero-and few-shot priming, and full finetuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ∼4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use. 1 * Equal contribution. Authors listed alphabetically.

show abstract

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Iyer¹,

Victoria²,

Pasunuru³

et al. 2022

Preprint

View full text Add to dashboard Cite

DART: Open-Domain Structured Data Record to Text Generation

Nan¹,

Radev²,

Zhang³

et al. 2020

Preprint

View full text Add to dashboard Cite

SParC: Cross-Domain Semantic Parsing in Context

Chen¹,

Zhang²,

Yasunaga³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lin, Xi Victoria

OPT: Open Pre-trained Transformer Language Models

Efficient Large Scale Language Modeling with Mixtures of Experts

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

DART: Open-Domain Structured Data Record to Text Generation

SParC: Cross-Domain Semantic Parsing in Context

Contact Info

Product

Resources

About