2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00675
|View full text |Cite
|
Sign up to set email alerts
|

Multiscale Vision Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
324
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 756 publications
(391 citation statements)
references
References 45 publications
0
324
0
Order By: Relevance
“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”
Section: Related Workmentioning
confidence: 99%
“…As in existing video transformer models [7,2], we pre-process the video into a sequence of ST tokens x st ∈ R D , for a spatial resolution of S and a temporal resolution of T . We use a cuboid embedding [2,22], where disjoint spatio-temporal cubes from the input volume are linearly projected to R D (equivalent to a 3D convolution with downsampling). We also test an embedding of disjoint image patches [20].…”
Section: Trajectory Attention For Video Datamentioning
confidence: 99%
See 2 more Smart Citations
“…Action Recognition with Transformer. Following the vision transformer (ViT) [13], which demonstrates competitive performance against CNN models on image classification, many recent works attempt to extend the vision transformer for action recognition [36,25,3,1,14]. VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction.…”
Section: Related Workmentioning
confidence: 99%