2022
DOI: 10.48550/arxiv.2205.13137
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

Abstract: In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 40% in BEiT). In contra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(27 citation statements)
references
References 27 publications
0
27
0
Order By: Relevance
“…SimMIM [45] Swin-B RGB 100% 60% 800 84.0 MCMAE [11] CViT-B RGB 25% 75% 1600 85.0 MixMIM [29] MixMIM-B RGB 100% 100% 600 85.1 CMAE [19] CViT-B RGB 25% 75% 1600 85.3…”
Section: Methodsmentioning
confidence: 99%
“…SimMIM [45] Swin-B RGB 100% 60% 800 84.0 MCMAE [11] CViT-B RGB 25% 75% 1600 85.0 MixMIM [29] MixMIM-B RGB 100% 100% 600 85.1 CMAE [19] CViT-B RGB 25% 75% 1600 85.3…”
Section: Methodsmentioning
confidence: 99%
“…Specifically, during pretraining, we partition the training images into patches and feed a portion of them into the encoder following Masked Autoencoder [14]. Our GeoMIM decoder then uses these encoded visible to-kens to reconstruct the pretrained LiDAR model's BEV feature in the BEV space instead of commonly used RGB pixels [47,14,30] or depth points [3] as in existing MAE frameworks. To achieve this PV to BEV reconstruction, we first devise two branches to decouple the semantic and geometric parts, with one branch completing dense PV features and the other reconstructing the depth map.…”
Section: Pretrain Supervision Finetunementioning
confidence: 99%
“…In the past few years, contrastive learning is very popular, which aims to learn invariances from different augmented views of images [4,12,16]. Recently, Masked Image Modeling (MIM) [11,15,28,40] become more and more prevalent for vision transformers. MIM is the task that reconstructs image content from a masked image.…”
Section: Related Workmentioning
confidence: 99%