2022
DOI: 10.48550/arxiv.2202.03026
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Context Autoencoder for Self-Supervised Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
84
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 40 publications
(84 citation statements)
references
References 0 publications
0
84
0
Order By: Relevance
“…DINO [3] ViT-B/16 300 44.1 BEIT [2] ViT-B/16 800 45.6 MAE [16] ViT-B/16 1600 48.1 CAE [5] ViT-B/16 800 48.8 PeCo [12] ViT It is admitted that the above significant improvements of our MVP could own to the super large-scale multimodal dataset while pre-training CLIP. To valid this, we also conduct comparisons with the BEIT model pre-trained on ImageNet-21K, which contains about 21K classes.…”
Section: Methodsmentioning
confidence: 97%
See 1 more Smart Citation
“…DINO [3] ViT-B/16 300 44.1 BEIT [2] ViT-B/16 800 45.6 MAE [16] ViT-B/16 1600 48.1 CAE [5] ViT-B/16 800 48.8 PeCo [12] ViT It is admitted that the above significant improvements of our MVP could own to the super large-scale multimodal dataset while pre-training CLIP. To valid this, we also conduct comparisons with the BEIT model pre-trained on ImageNet-21K, which contains about 21K classes.…”
Section: Methodsmentioning
confidence: 97%
“…Moreover, with the pixel-level information reconstruction of each masked patch, MAE [16] further improved the final results. Concurrently, some similar MIMbased schemes [37,34,5] have been proposed and pushed forward the development of visual pre-training. In this work, we also utilize the MIM-based framework but design a special multimodality-driven pretext task to guide the visual models learning more multimodal semantic knowledge.…”
Section: Visual Pre-trainingmentioning
confidence: 99%
“…On the pixel-level aspect, MAE [18] and SimMIM simply mask the pixels in the patches and then predict them to encourage model focus on the semantics. Similarly, CiM [14] and CAE [6] are proposed to achieve the same goal but with more sophisticated structure designs. However, some useful local details may lost in these methods.…”
Section: Visual Pre-trainingmentioning
confidence: 99%
“…† Contact person. eling (MIM) [6,12,18,49,52,63], exhibits promising potential, which inherits the "mask-and-reconstruct" thought from masked autoencoding methods in natural language processing (NLP) field, such as BERT [11]. More concretely, parts of content in input image are masked to learn latent representations from the visible regions by encoder, which are then used to reconstruct content by decoder.…”
Section: Introductionmentioning
confidence: 99%
“…The method of DABS [63] also uses the idea of patch misplacement, but it does not have a way to handle degenerate learning and it does not show performance improvements. A technique that has proven very beneficial to improve the training efficiency of vision transformers is token dropping [1,36,25,10]. We extend this technique by randomizing the token dropping amount and including the case of no dropping to narrow the domain gap between pre-training and transfer.…”
Section: Related Workmentioning
confidence: 99%