2022
DOI: 10.48550/arxiv.2205.01068
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

OPT: Open Pre-trained Transformer Language Models

Abstract: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero-and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
231
1
2

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 180 publications
(236 citation statements)
references
References 24 publications
2
231
1
2
Order By: Relevance
“…Model Architectures: We replicate publicly available references for Transformer language model architectures [53,54]. We use the 125 million, 355 million, 1.3 billion, 2.7 billion, 6.7 billion, and 13 billion model configurations (see § A.4 for more explicit architecture and hyperparameter configurations).…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Model Architectures: We replicate publicly available references for Transformer language model architectures [53,54]. We use the 125 million, 355 million, 1.3 billion, 2.7 billion, 6.7 billion, and 13 billion model configurations (see § A.4 for more explicit architecture and hyperparameter configurations).…”
Section: Methodsmentioning
confidence: 99%
“…In this section, we layout the details of experiments, although most training details we pull directly from publicly available references [53,54]. As such, we provide the details of model architectures using the same style as Table 1 in [53] for ease of comparison. All models use GELU activation [74] for nonlinearity.…”
Section: A4 Model Training/dataset Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…Triggered by GPT-3, a plethora of other large language models, which are different variants of the transformer architecture [28], have been developed. Some of the most powerful ones are PaLM [4], GLaM [6], Megatron-Turing NLG [23], Meta-OPT [31], Gopher [21], LaMDA [27] and Chinchilla [9]. PaLM currently provides the state-of-the-art performance in NLP tasks such as natural language translation, predicting long-range text dependencies and even translation to structured representations [4].…”
Section: Introductionmentioning
confidence: 99%