2022
DOI: 10.48550/arxiv.2203.15556
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Training Compute-Optimal Large Language Models

Abstract: We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

9
159
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 167 publications
(256 citation statements)
references
References 25 publications
9
159
0
Order By: Relevance
“…This trend has been justified by the findings of Kaplan et al (2020), who show that language modelling performance is strongly correlated with model size. Recently, Hoffmann et al (2022) have refined these findings, showing that the number of data tokens should scale at the same rate as the model size to maximise computational efficiency. Based on these findings, they introduced the Chinchilla family of models, which we build upon, using the 70B parameter Chinchilla model as the base LM for our largest Flamingo model.…”
Section: Language Modellingmentioning
confidence: 96%
See 4 more Smart Citations
“…This trend has been justified by the findings of Kaplan et al (2020), who show that language modelling performance is strongly correlated with model size. Recently, Hoffmann et al (2022) have refined these findings, showing that the number of data tokens should scale at the same rate as the model size to maximise computational efficiency. Based on these findings, they introduced the Chinchilla family of models, which we build upon, using the 70B parameter Chinchilla model as the base LM for our largest Flamingo model.…”
Section: Language Modellingmentioning
confidence: 96%
“…(b) Architectural innovations and training strategies that effectively leverage large pretrained vision-only and language-only models, preserving the benefits of these initial models while efficiently fusing the modalities. Starting from Chinchilla, a 70B state-of-the-art LM (Hoffmann et al, 2022), we train Flamingo, an 80B parameter VLM. (c) Efficient ways to adapt to visual inputs of varying size, making Flamingo applicable to images and videos.…”
Section: Contributionsmentioning
confidence: 99%
See 3 more Smart Citations