SimMIM: A Simple Framework for Masked Image Modeling

Xie, Zhenda; Zhang, Zheng; Cao, Yue; Lin, Yutong; Bao, Jianmin; Yao, Zhuliang; Dai, Qi; Hu, Han

doi:10.48550/arxiv.2111.09886

Cited by 41 publications

(110 citation statements)

References 45 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…Masking Strategy and Masking Ratio. As shown in Table 4, we observe CIM works better with simple random masking (He et al, 2021;Xie et al, 2021) compared with the blockwise masking strategy proposed in BEiT.…”

Section: Ablation Studiesmentioning

confidence: 79%

“…(MIM; Bao et al, 2021), which randomly masks out some input tokens and then recovers the masked content by conditioning on the visible context, is able to learn rich visual representations and shows promising performance on various vision benchmarks (Zhou et al, 2021;He et al, 2021;Xie et al, 2021;Dong et al, 2021;Wei et al, 2021;El-Nouby et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Fang¹,

Liu²,

Bao³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. For example, 300-epoch CIM pretrained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and 80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification respectively. * Contribution during internship at Microsoft.

show abstract

Section: Ablation Studiesmentioning

confidence: 79%

Section: Introductionmentioning

confidence: 99%

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Fang¹,

Liu²,

Bao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In this work, we found that our proposed MIM is more effective than MLM. Inspired by recent works of self-supervised learning on vision [12,42], we propose to mask out image patches with larger proportion and follow MaskFeat [41] to reconstruct other views of the whole image rather than recovering those masked regions only.…”

Section: Related Workmentioning

confidence: 99%

“…In this work, we disregard the reconstruction of each masked region, but instead recover the holistic image signal at v cls token. We first follow [12,42] to use a larger masking ratio of 50% (instead of 15% as in [5,25,36]). The masked patches are replaced with grey pixels.…”

Section: Image-text Pre-trainingmentioning

confidence: 99%

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Yu¹,

Sinha²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CommerceMM -a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Queryto-Product Retrieval, Image-to-Product Retrieval, etc. We follow the pre-training + fine-tuning training regime and present 5 effective pre-training tasks on image-text pairs. To embrace more common and diverse commerce data with text-to-multimodal, image-tomultimodal, and multimodal-to-multimodal mapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training. The pre-training is conducted in an efficient manner with only two forward/backward updates for the combined 14 tasks. Extensive experiments and analysis show the effectiveness of each task. When combining all pre-training tasks, our model achieves state-of-the-art performance on 7 commercerelated downstream tasks after fine-tuning. Additionally, we propose a novel approach of modality randomization to dynamically adjust our model under different efficiency constraints. CCS CONCEPTS• Computing methodologies → Neural networks; • Information systems → Multimedia and multimodal retrieval; Online shopping.

show abstract

“…Recent advancements in self-supervised representation learning show masked image modeling (MIM) [3,24,8,11] as an effective pre-training strategy for the Vision Transformer (ViT) [7], which is powerful yet hard to train because of lack of inductive bias. The basic idea of MIM is masking and reconstructing: masking a set of image patches before input into the transformer and reconstructing these masked patches at the output.…”

Section: Introductionmentioning

confidence: 99%

Self Pre-training with Masked Autoencoders for Medical Image Analysis

Zhou¹,

Liu²,

Bae³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked Autoencoder (MAE) has recently been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis. By performing the pretext task of reconstructing the original image from only partial observations, the encoder, which is a ViT, is encouraged to aggregate contextual information to infer content in masked image regions. We believe that this context aggregation ability is also essential to the medical image domain where each anatomical structure is functionally and mechanically connected to other structures and regions. However, there is no ImageNet-scale medical image dataset for pretraining. Thus, in this paper, we investigate a self pre-training paradigm with MAE for medical images, i.e., models are pre-trained on the same target dataset. To validate the MAE self pre-training, we consider three diverse medical image tasks including chest X-ray disease classification, CT abdomen multi-organ segmentation and MRI brain tumor segmentation. It turns out MAE self pre-training benefits all the tasks markedly. Specifically, the mAUC on lung disease classification is increased by 9.4%. The average DSC on brain tumor segmentation is improved from 77.4% to 78.9%. Most interestingly, on the small-scale multi-organ segmentation dataset (N=30), the average DSC improves from 78.8% to 83.5% and the HD95 is reduced by 60%, indicating its effectiveness in limited data scenarios. The segmentation and classification results reveal the promising potential of MAE self pre-training for medical image analysis.

show abstract

SimMIM: A Simple Framework for Masked Image Modeling

Cited by 41 publications

References 45 publications

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Self Pre-training with Masked Autoencoders for Medical Image Analysis

Contact Info

Product

Resources

About