2021
DOI: 10.48550/arxiv.2110.02095
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exploring the Limits of Large Scale Pre-training

Samira Abnar,
Mostafa Dehghani,
Behnam Neyshabur
et al.

Abstract: Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

4
34
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(40 citation statements)
references
References 38 publications
4
34
0
Order By: Relevance
“…Our study builds on [49], which trained convolutional networks on billions of images to predict associated hashtags. Compared to [49], our study: (1) trains larger models with more efficient convolutional and transformer architectures on a much larger dataset, (2) studies the performance of the resulting models in zero-shot transfer settings in addition to standard transfer-learning experiments, (3) performs comparisons of our models with state-of-the-art selfsupervised learners, and (4) presents an in-depth study of potential harmful associations that models may adopt from the weak supervision they receive. Despite the conceptual similarities in our approach, our best model achieves an ImageNet-1K validation accuracy that is more than 3% higher than that reported in [49].…”
Section: Related Workmentioning
confidence: 83%
See 3 more Smart Citations
“…Our study builds on [49], which trained convolutional networks on billions of images to predict associated hashtags. Compared to [49], our study: (1) trains larger models with more efficient convolutional and transformer architectures on a much larger dataset, (2) studies the performance of the resulting models in zero-shot transfer settings in addition to standard transfer-learning experiments, (3) performs comparisons of our models with state-of-the-art selfsupervised learners, and (4) presents an in-depth study of potential harmful associations that models may adopt from the weak supervision they receive. Despite the conceptual similarities in our approach, our best model achieves an ImageNet-1K validation accuracy that is more than 3% higher than that reported in [49].…”
Section: Related Workmentioning
confidence: 83%
“…Some recent studies have also used the much larger JFT-300M [20] and JFT-3B [76] image datasets, but not much is known publicly about those datasets. The effectiveness of supervised pre-training has been the subject of a number of studies, in particular, [1,42,60] analyze the transfer performance of supervised pre-trained models. Self-supervised pre-training has seen tremendous progress in recent years.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Another interesting insight is how the scaling law of DSI differs from Dual Encoders. Understanding the scaling behaviour of Transformers have garnered significant interest in recent years (Kaplan et al, 2020;Tay et al, 2021;Abnar et al, 2021). We find that the gain in retrieval performance obtained from increasing model parameterization in DE seems to be relatively small.…”
Section: Scaling Lawsmentioning
confidence: 85%