Context Autoencoder for Self-Supervised Representation Learning

Chen, Xiaokang; Ding, Mingyu; Wang, Xiaodi; Xiao, Ying; Mo, Shentong; Wang, Yunhao; Han, Shijie; Luo, Ping; Zhang, Gang; Wang, Jingdong

doi:10.48550/arxiv.2202.03026

Cited by 40 publications

(84 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DINO [3] ViT-B/16 300 44.1 BEIT [2] ViT-B/16 800 45.6 MAE [16] ViT-B/16 1600 48.1 CAE [5] ViT-B/16 800 48.8 PeCo [12] ViT It is admitted that the above significant improvements of our MVP could own to the super large-scale multimodal dataset while pre-training CLIP. To valid this, we also conduct comparisons with the BEIT model pre-trained on ImageNet-21K, which contains about 21K classes.…”

Section: Methodsmentioning

confidence: 97%

“…Moreover, with the pixel-level information reconstruction of each masked patch, MAE [16] further improved the final results. Concurrently, some similar MIMbased schemes [37,34,5] have been proposed and pushed forward the development of visual pre-training. In this work, we also utilize the MIM-based framework but design a special multimodality-driven pretext task to guide the visual models learning more multimodal semantic knowledge.…”

Section: Visual Pre-trainingmentioning

confidence: 99%

See 1 more Smart Citation

MVP: Multimodality-guided Visual Pre-training

Xie¹,

Zhou²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the tokenlevel features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4% mIoU on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8%.

show abstract

Section: Methodsmentioning

confidence: 97%

Section: Visual Pre-trainingmentioning

confidence: 99%

MVP: Multimodality-guided Visual Pre-training

Xie¹,

Zhou²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…On the pixel-level aspect, MAE [18] and SimMIM simply mask the pixels in the patches and then predict them to encourage model focus on the semantics. Similarly, CiM [14] and CAE [6] are proposed to achieve the same goal but with more sophisticated structure designs. However, some useful local details may lost in these methods.…”

Section: Visual Pre-trainingmentioning

confidence: 99%

“…† Contact person. eling (MIM) [6,12,18,49,52,63], exhibits promising potential, which inherits the "mask-and-reconstruct" thought from masked autoencoding methods in natural language processing (NLP) field, such as BERT [11]. More concretely, parts of content in input image are masked to learn latent representations from the visible regions by encoder, which are then used to reconstruct content by decoder.…”

Section: Introductionmentioning

confidence: 99%

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

Líu¹,

Jiang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others directly infuse semantics into targets in offline way requiring extra data. Different from them, we shift the perspective to the Fourier domain which naturally has global perspective and present a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge 2 -AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints. Through this way, more robust representations can be learned in the pre-trained encoders, of which the effectiveness is confirmed by the juxtaposing experimental results on downstream recognition tasks. We also conduct several quantitative and qualitative experiments to investigate the learning behavior of our method. To our best knowledge, this is the first MIM work to solve the visual pre-training through the lens of frequency domain.

show abstract

“…The method of DABS [63] also uses the idea of patch misplacement, but it does not have a way to handle degenerate learning and it does not show performance improvements. A technique that has proven very beneficial to improve the training efficiency of vision transformers is token dropping [1,36,25,10]. We extend this technique by randomizing the token dropping amount and including the case of no dropping to narrow the domain gap between pre-training and transfer.…”

Section: Related Workmentioning

confidence: 99%

DILEMMA: Self-Supervised Shape and Texture Learning with Transformers

Sameni¹,

Jenni²,

Favaro³

2022

Preprint

View full text Add to dashboard Cite

There is a growing belief that deep neural networks with a shape bias may exhibit better generalization capabilities than models with a texture bias, because shape is a more reliable indicator of the object category. However, we show experimentally that existing measures of shape bias are not stable predictors of generalization and argue that shape discrimination should not come at the expense of texture discrimination. Thus, we propose a pseudo-task to explicitly boost both shape and texture discriminability in models trained via self-supervised learning. For this purpose, we train a ViT to detect which input token has been combined with an incorrect positional embedding. To retain texture discrimination, the ViT is also trained as in MoCo with a student-teacher architecture and a contrastive loss over an extra learnable class token. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs. We evaluate our method through fine-tuning on several datasets and show that it outperforms MoCoV3 and DINO. Moreover, we show that when downstream tasks are strongly reliant on shape (such as in the YOGA-82 pose dataset), our pre-trained features yield a significant gain over prior work. Code will be released upon publication.Preprint. Under review.

show abstract

Context Autoencoder for Self-Supervised Representation Learning

Cited by 40 publications

References 0 publications

MVP: Multimodality-guided Visual Pre-training

MVP: Multimodality-guided Visual Pre-training

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

DILEMMA: Self-Supervised Shape and Texture Learning with Transformers

Contact Info

Product

Resources

About