2022
DOI: 10.48550/arxiv.2203.12602
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
75
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(85 citation statements)
references
References 0 publications
1
75
0
Order By: Relevance
“…The strong capability of modeling long-range relation has facilitated Transformer in various vision tasks, including image classification [27,56,54], object detection [10,88,20], semantic/instance segmentation [76], video understanding [7,2,28,51], point cloud modeling [85,35], 3D Object Recognition [18] and even low-level processing [16,53,74]. Furthermore, transformers have advanced the vision recognition performance by a large-scale pretraining [19,60,12,30,37,68,64]. In such a situation, given the pre-trained Transformer models, which are more larger than the previously prevalent CNN backbones, one open question is how to fine-tune the big vision models so that they can be adapted into more specific down-stream tasks.…”
Section: Transformer In Visionmentioning
confidence: 99%
See 1 more Smart Citation
“…The strong capability of modeling long-range relation has facilitated Transformer in various vision tasks, including image classification [27,56,54], object detection [10,88,20], semantic/instance segmentation [76], video understanding [7,2,28,51], point cloud modeling [85,35], 3D Object Recognition [18] and even low-level processing [16,53,74]. Furthermore, transformers have advanced the vision recognition performance by a large-scale pretraining [19,60,12,30,37,68,64]. In such a situation, given the pre-trained Transformer models, which are more larger than the previously prevalent CNN backbones, one open question is how to fine-tune the big vision models so that they can be adapted into more specific down-stream tasks.…”
Section: Transformer In Visionmentioning
confidence: 99%
“…The first one lies in the pre-training stage, which requires algorithms that can learn well-generalized representations that are easy to be applied to many tasks. Recent arts in self-supervised learning [11,5,37,87,75,68,29] can serve as a solution to this challenge. The second one, which is our main concern in this work, is to build an effective pipeline that can adapt the model obtained at the pre-training stage to various downstream tasks by tuning parameters as less as possible and keeping the left parameters frozen.…”
Section: Introductionmentioning
confidence: 99%
“…Utilizing unlabeled visual data in self-supervised manners to learn representations is intriguing but challenging. Following BERT [12] in natural language processing, pre-training with masked image modeling (MIM) shows great success on pretraining visual representations for various downstream vision tasks [19,3,45,44,39], including image classification [11], object detection [27], semantic segmentation [50], video classification [18], and motor control [44].…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by the success of BERT, the vision community has recently raised great interest in imitating its formulation (i.e., masked autoencoding) for image understanding. A series of works [2,13,49,6,19,42,39,35] has been proposed in past months, where Masked AutoEncoder (MAE) [19] becomes one of the most representative methods which significantly optimizes both the pre-training efficiency and fine-tuning accuracy, successfully leading the new trend of SSL across vision tasks.…”
Section: Introductionmentioning
confidence: 99%