2022
DOI: 10.48550/arxiv.2205.10063
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Abstract: Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 33 publications
0
16
0
Order By: Relevance
“…Unfortunately, hViT cannot be directly applied to enable MAE pretraining because the local window attention used in hViT makes it difficult to handle randomly masked patches as in MAE. Recently, Uniform Masking MAE (UM-MAE) [154] is proposed to empower MAE with hViTs, which introduces a two-stage pipeline: sampling and masking. It starts by randomly sampling a portion of patches (25% reported in the paper) from each block, and then follows by masking additional patches on top of the sampled ones.…”
Section: Learning By Reconstructionmentioning
confidence: 99%
“…Unfortunately, hViT cannot be directly applied to enable MAE pretraining because the local window attention used in hViT makes it difficult to handle randomly masked patches as in MAE. Recently, Uniform Masking MAE (UM-MAE) [154] is proposed to empower MAE with hViTs, which introduces a two-stage pipeline: sampling and masking. It starts by randomly sampling a portion of patches (25% reported in the paper) from each block, and then follows by masking additional patches on top of the sampled ones.…”
Section: Learning By Reconstructionmentioning
confidence: 99%
“…An inevitable bottleneck for the industrial applications of MIM is that these models typically require huge computational resources and long pre-training duration. To this end, some works accelerate the encoding process via the asymmetric encoder-decoder strategy [24,31] or lessening the input patches [8,35]. Only accelerating the encoding process sometimes doesn't really speed up the representation learning, like GreenMIM vs. SimMIM 192 in Fig.…”
Section: Related Workmentioning
confidence: 99%
“…Further, GreenMIM [31] extends the asymmetric encoder-decoder strategy to hierarchical vision transformers (e.g., Swin [39]). Besides, [8,22,35] shrinks the input resolution to lessen the input patches, thereby reducing the computational burden. However, they all aim to accelerate the encoding process rather than the representation learning.…”
Section: Introductionmentioning
confidence: 99%
“…CAE [37] separates the encoder representation from the prediction task and makes predictions in the latent representation space from visible patches to mask patches. UM-MAE [38] successfully uses quadratic masking strategy to achieve self-supervision in pyramid networks like Swin Transformer [13], PVT [14], etc. ConvMAE [39] presents a simple self-supervised learning framework with a block-wise masking strategy, which demonstrates that multi-scale features from supervised encoders can improve the performance of downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…The very recent approach Green-MAE [40] is similar to our approach, allowing the hierarchical models to discard masked patches and operate only on the visible ones. Our CoTMAE benefits from the development of hybrid convolutional-transformer pyramid networks and useful experience gained from recent works [34][35][36][37][38][39][40][41][42].…”
Section: Related Workmentioning
confidence: 99%