2021
DOI: 10.48550/arxiv.2111.09886
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SimMIM: A Simple Framework for Masked Image Modeling

Abstract: This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as blockwise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
109
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 41 publications
(110 citation statements)
references
References 45 publications
1
109
0
Order By: Relevance
“…Masking Strategy and Masking Ratio. As shown in Table 4, we observe CIM works better with simple random masking (He et al, 2021;Xie et al, 2021) compared with the blockwise masking strategy proposed in BEiT.…”
Section: Ablation Studiesmentioning
confidence: 79%
See 1 more Smart Citation
“…Masking Strategy and Masking Ratio. As shown in Table 4, we observe CIM works better with simple random masking (He et al, 2021;Xie et al, 2021) compared with the blockwise masking strategy proposed in BEiT.…”
Section: Ablation Studiesmentioning
confidence: 79%
“…(MIM; Bao et al, 2021), which randomly masks out some input tokens and then recovers the masked content by conditioning on the visible context, is able to learn rich visual representations and shows promising performance on various vision benchmarks (Zhou et al, 2021;He et al, 2021;Xie et al, 2021;Dong et al, 2021;Wei et al, 2021;El-Nouby et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…In this work, we found that our proposed MIM is more effective than MLM. Inspired by recent works of self-supervised learning on vision [12,42], we propose to mask out image patches with larger proportion and follow MaskFeat [41] to reconstruct other views of the whole image rather than recovering those masked regions only.…”
Section: Related Workmentioning
confidence: 99%
“…In this work, we disregard the reconstruction of each masked region, but instead recover the holistic image signal at v cls token. We first follow [12,42] to use a larger masking ratio of 50% (instead of 15% as in [5,25,36]). The masked patches are replaced with grey pixels.…”
Section: Image-text Pre-trainingmentioning
confidence: 99%
“…Recent advancements in self-supervised representation learning show masked image modeling (MIM) [3,24,8,11] as an effective pre-training strategy for the Vision Transformer (ViT) [7], which is powerful yet hard to train because of lack of inductive bias. The basic idea of MIM is masking and reconstructing: masking a set of image patches before input into the transformer and reconstructing these masked patches at the output.…”
Section: Introductionmentioning
confidence: 99%