2022
DOI: 10.48550/arxiv.2205.13543
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revealing the Dark Secrets of Masked Image Modeling

Abstract: Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower laye… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 68 publications
0
6
0
Order By: Relevance
“…Hence, VoxFormer with stereo depth performs best. Note that our framework can be integrated with any state-of-theart depth models, so using a stronger existing depth predictor [74][75][76] could enhance our SSC performance. Meanwhile, VoxFormer can be further promoted along with the advancement of depth estimation.…”
Section: Ablation Studiesmentioning
confidence: 99%
“…Hence, VoxFormer with stereo depth performs best. Note that our framework can be integrated with any state-of-theart depth models, so using a stronger existing depth predictor [74][75][76] could enhance our SSC performance. Meanwhile, VoxFormer can be further promoted along with the advancement of depth estimation.…”
Section: Ablation Studiesmentioning
confidence: 99%
“…In this work, we develop our pretraining objective based on a masked image modeling approach like [20,44]. MIM has recently been shown to be particularly effective in the natural image domain, surpassing many contrastive works and being shown to be friendlier to downstream optimization [3,20,43,44,49]. Exploration of the masked image modeling framework in geospatial applications is still in its early stages, and could help allivate some concerns with contrastive approaches in this domain.…”
Section: Geospatial Pretrainingmentioning
confidence: 99%
“…Mixing-up (Wickstrøm et al, 2022) exploits a data augmentation scheme in which new samples are generated by mixing two data samples and the model is optimized to predict the mixing weights. Note that contrastive learning mainly focuses on the highlevel information (Xie et al, 2022a) and the series-wise or patch-wise representations inherently mismatch the lowlevel tasks, such as time series forecasting. Thus, in this paper, we focus on the masked modeling paradigm.…”
Section: Related Workmentioning
confidence: 99%
“…Masked modeling has been explored in stacked denoising autoencoders (Vincent et al, 2010), where the masking is viewed as adding noise to the original data and the masked modeling is to project the masked data from the neighborhood back to the original manifold, namely denoising. Recently, it has been widely used in pre-training, which can learn valuable low-level information from data unsupervisedly (Xie et al, 2022a). Inspired by the manifold perspective, we go beyond the classical denoising process and project the masked data back to the manifold by aggregating multiple masked time series within the neighborhood.…”
Section: Understanding Masked Modelingmentioning
confidence: 99%