2021
DOI: 10.48550/arxiv.2112.10740
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

3
37
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(40 citation statements)
references
References 0 publications
3
37
0
Order By: Relevance
“…Some recent approaches have started to explore the combination of joint-embedding architectures and denoising pre-training tasks (El-Nouby et al, 2021;Baevski et al, 2022;Zhou et al, 2021). Those approaches mask an image by replacing the masked patches with a learnable mask token, and output a single vector for each masked patch.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Some recent approaches have started to explore the combination of joint-embedding architectures and denoising pre-training tasks (El-Nouby et al, 2021;Baevski et al, 2022;Zhou et al, 2021). Those approaches mask an image by replacing the masked patches with a learnable mask token, and output a single vector for each masked patch.…”
Section: Related Workmentioning
confidence: 99%
“…loss, iBOT(Zhou et al, 2021) and SplitMask(El-Nouby et al, 2021) apply a joint-embedding loss to an output representing the global sequence (either the [CLS] token or a global average pool of the patch vectors). SplitMask shows that by using a patch-level loss, you can reduce the amount of unlabeled pre-training data.…”
mentioning
confidence: 99%
“…The method of DABS [63] also uses the idea of patch misplacement, but it does not have a way to handle degenerate learning and it does not show performance improvements. A technique that has proven very beneficial to improve the training efficiency of vision transformers is token dropping [1,36,25,10]. We extend this technique by randomizing the token dropping amount and including the case of no dropping to narrow the domain gap between pre-training and transfer.…”
Section: Related Workmentioning
confidence: 99%
“…(MIM; Bao et al, 2021), which randomly masks out some input tokens and then recovers the masked content by conditioning on the visible context, is able to learn rich visual representations and shows promising performance on various vision benchmarks (Zhou et al, 2021;He et al, 2021;Xie et al, 2021;Dong et al, 2021;Wei et al, 2021;El-Nouby et al, 2021).…”
Section: Introductionmentioning
confidence: 99%