VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan, Tong; Song, Yibing; Wang, Jue; Wang, Limin

doi:10.48550/arxiv.2203.12602

Cited by 32 publications

(85 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The strong capability of modeling long-range relation has facilitated Transformer in various vision tasks, including image classification [27,56,54], object detection [10,88,20], semantic/instance segmentation [76], video understanding [7,2,28,51], point cloud modeling [85,35], 3D Object Recognition [18] and even low-level processing [16,53,74]. Furthermore, transformers have advanced the vision recognition performance by a large-scale pretraining [19,60,12,30,37,68,64]. In such a situation, given the pre-trained Transformer models, which are more larger than the previously prevalent CNN backbones, one open question is how to fine-tune the big vision models so that they can be adapted into more specific down-stream tasks.…”

Section: Transformer In Visionmentioning

confidence: 99%

“…The first one lies in the pre-training stage, which requires algorithms that can learn well-generalized representations that are easy to be applied to many tasks. Recent arts in self-supervised learning [11,5,37,87,75,68,29] can serve as a solution to this challenge. The second one, which is our main concern in this work, is to build an effective pipeline that can adapt the model obtained at the pre-training stage to various downstream tasks by tuning parameters as less as possible and keeping the left parameters frozen.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa¹,

Ge²,

Zhan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-andplay in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something v2 and HMDB51, respectively.

show abstract

Section: Transformer In Visionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa¹,

Ge²,

Zhan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Utilizing unlabeled visual data in self-supervised manners to learn representations is intriguing but challenging. Following BERT [12] in natural language processing, pre-training with masked image modeling (MIM) shows great success on pretraining visual representations for various downstream vision tasks [19,3,45,44,39], including image classification [11], object detection [27], semantic segmentation [50], video classification [18], and motor control [44].…”

Section: Introductionmentioning

confidence: 99%

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

Liu¹,

Huang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for neural networks with comparable model sizes (e.g., ViT-B) among MIM methods. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods. Code is available at https://github.com/Sense-X/MixMIM.Preprint. Under review.

show abstract

“…Inspired by the success of BERT, the vision community has recently raised great interest in imitating its formulation (i.e., masked autoencoding) for image understanding. A series of works [2,13,49,6,19,42,39,35] has been proposed in past months, where Masked AutoEncoder (MAE) [19] becomes one of the most representative methods which significantly optimizes both the pre-training efficiency and fine-tuning accuracy, successfully leading the new trend of SSL across vision tasks.…”

Section: Introductionmentioning

confidence: 99%

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Li¹,

Wang²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM) strategy, successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that strictly samples 1 random patch from each 2 × 2 grid, and a Secondary Masking (SM) which randomly masks a portion of (usually 25%) the already sampled regions as learnable tokens. US preserves equivalent elements across multiple non-overlapped local windows, resulting in the smooth support for popular Pyramid-based ViTs; whilst SM is designed for better transferable visual representations since US reduces the difficulty of pixel recovery pre-task that hinders the semantic learning. We demonstrate that UM-MAE significantly improves the pre-training efficiency (e.g., it speeds up by ∼ 2× and reduces the GPU memory by at least ∼ 2×) of Pyramid-based ViTs, but maintains the competitive (or even better) fine-tuning performance across downstream tasks. For example using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The code and pre-trained models are available at https://github.com/implus/UM-MAE.

show abstract

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Cited by 32 publications

References 0 publications

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Contact Info

Product

Resources

About