2021
DOI: 10.48550/arxiv.2111.06377
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Autoencoders Are Scalable Vision Learners

Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

22
852
4

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 298 publications
(945 citation statements)
references
References 35 publications
22
852
4
Order By: Relevance
“…For ResNet-200, the initial number of blocks at each stage is (3,24,36,3). We change it to Swin-B's (3, 3, 27, 3) at the step of changing stage ratio.…”
Section: Modernizing Resnets: Detailed Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For ResNet-200, the initial number of blocks at each stage is (3,24,36,3). We change it to Swin-B's (3, 3, 27, 3) at the step of changing stage ratio.…”
Section: Modernizing Resnets: Detailed Resultsmentioning
confidence: 99%
“…4). We use the supervised training results from DeiT [68] for ViT-S/B and MAE [24] for ViT-L, as they employ improved training procedures over the original ViTs [18]. ConvNeXt models are trained with the same settings as before, but with longer warmup epochs.…”
Section: Isotropic Convnext Vs Vitmentioning
confidence: 99%
“…However, BYOL (Grill et al, 2020) finds that when maximizing the similarity between two augmentations of one image, negative sample pairs are not necessary. Further, SimSiam (Chen and He, 2021) finds that momentum encoder is also not necessary while a stop-gradient operation applied on one side is enough for learning transferable representations.…”
Section: Jiang Et Almentioning
confidence: 99%
“…Autoencoding is a classical method for representation learning [25,46], which has been out-performed by contrastive learning approaches for years. However, the recent work in this line, He et al [18], has reclaimed state-of-the-art performance.…”
Section: Related Workmentioning
confidence: 99%
“…Substantial effort has been devoted to self-supervised learning methods for 2D images [6,9,25,34,51]. Among this line, autoencoder is one of the most classical methods [3,18,34,45,46]. Typically, it has an encoder that transforms the input into a latent code and a decoder that expands the latent code to reconstruct the input.…”
Section: Introductionmentioning
confidence: 99%