2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00320
|View full text |Cite
|
Sign up to set email alerts
|

Video Swin Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
125
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 542 publications
(130 citation statements)
references
References 23 publications
0
125
0
Order By: Relevance
“…However, CNN has a limited receptive field and cannot effectively capture longrange dependency. Recent works have extended Vision Transformer [13] for video representation and demonstrated the benefit of long-range temporal learning [5,33]. To reduce the computational cost, TimeSformer [5] introduces a factorized spacetime attention, while Video Swin-Transformer [32] restricts self-attention in a local 3D window.…”
Section: Video Representationmentioning
confidence: 99%
See 1 more Smart Citation
“…However, CNN has a limited receptive field and cannot effectively capture longrange dependency. Recent works have extended Vision Transformer [13] for video representation and demonstrated the benefit of long-range temporal learning [5,33]. To reduce the computational cost, TimeSformer [5] introduces a factorized spacetime attention, while Video Swin-Transformer [32] restricts self-attention in a local 3D window.…”
Section: Video Representationmentioning
confidence: 99%
“…Although long-form video-language joint learning has been explored in downstream tasks [16,27,28,30,58,60,62], they either use pre-extracted video features which lead to the sub-optimal problem, or utilize image encoder to extract frame features that fail to model the long-range dependency in long-form videos. Recent works [3,5,33] have shown that a video Transformer [48] backbone helps to capture long-range dependency in an end-to-end fashion. An intuitive way for long-form video-language pre-training is to adopt a video Transformer based short-form video-language pretraining model [3,54] with long-form data.…”
Section: Introductionmentioning
confidence: 99%
“…Spacetime attention in video transformers. With the advances of the Vision Transformer [25] as a new way to extract image embeddings, many 'spatial-temporal transformer' architectures have been developed in the video domain [26][27][28]. These works explore and propose interesting solutions for how to organize spatial attention and temporal attention with either coupled (series) [28] and factorized (parallel) attention blocks [26], as well as how to create better tokens for videos by creating three-dimensional spatio-temporal 'tubes' as the tubelet tokenizations [26].…”
Section: Transformersmentioning
confidence: 99%
“…The experimental results show that the algorithm achieves a good tradeoff between speed and performance. Based on the picture classification structure, Video Swin Transformer [42] adds the time dimension, and good results are achieved. ViViT [43] discussed four different ways to realize spatiotemporal attention on the basis of VIT [40].…”
Section: Related Workmentioning
confidence: 99%