2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01332
|View full text |Cite
|
Sign up to set email alerts
|

VidTr: Video Transformer Without Convolutions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
58
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 139 publications
(58 citation statements)
references
References 29 publications
0
58
0
Order By: Relevance
“…Patch Tokenization. Most VTs follow ViT [7] and employ a 2D-based patch tokenization [9], [88], [114], [138], dividing the input video frames into regions of fixed size h × w. For instance, 16 × 16 [9], [88], [138] or even multi-scale patch tokenization, ranging from 9 × 5 to 108 × 60 [114]. Others propose using 3D patches instead [11], [12], [48], [49], taking into account the time dimension in small t × h × w regions (e.g., 4 × 16 × 16 [49]).…”
Section: Tokenizationmentioning
confidence: 99%
See 2 more Smart Citations
“…Patch Tokenization. Most VTs follow ViT [7] and employ a 2D-based patch tokenization [9], [88], [114], [138], dividing the input video frames into regions of fixed size h × w. For instance, 16 × 16 [9], [88], [138] or even multi-scale patch tokenization, ranging from 9 × 5 to 108 × 60 [114]. Others propose using 3D patches instead [11], [12], [48], [49], taking into account the time dimension in small t × h × w regions (e.g., 4 × 16 × 16 [49]).…”
Section: Tokenizationmentioning
confidence: 99%
“…Other works have leveraged full CNN backbone architectures and still manage to train end-to-end, either with a pre-trained backbone [55], [112], [117], or training it from scratch with the Transformer [82], [97], [123], [137]. Some were able to do so thanks to using only a few (from 1 to 4) Transformer layers [42], [76], [138], showing that adding a few Transformer layers right after a large backbone may be enough to boost performance. Some other's success is attributable to leveraging efficient designs, such as local SA [119] or weight sharing [64] -as seen in Sec.…”
Section: Training Regimementioning
confidence: 99%
See 1 more Smart Citation
“…Following this trend, a high number of models has been proposed, which are designed to achieve high accuracy [44], resource-efficiency [45], [46] or are tailored for specific tasks, such as object detection [47] or semantic segmentation [22]. While multiple works leverage visual transformer models in conventional video-based human activity classification [19], [48], [49], or standard sequence transformers for skeleton encodings [28], [50], [51], their potential of visual transformers as data-efficient encoders of body movement cast as images has not been considered yet and is the main motivation of our work. Furthermore, inspired by the recent success of feature augmentation method in semi-supervised learning [24], we for the first time propose training a visual transformer with an additional auxiliary branch for augmenting the embeddings using category-specific prototypes and self-attention.…”
Section: Related Workmentioning
confidence: 99%
“…Token adoptation in vision tasks: At the moment, tokenbased models are widely applied in almost all domains in vision, including classification [22,46,65], object detection [6,16,90], segmentation [23,74], image generation [5,20,24,38,43], video understanding [1,2,4,9,25,28,41,45,47,49,56,85], dense prediction [54,75], point clouds processing [30,88], reinforcement learning [10,37] and tracking [60].…”
Section: Related Workmentioning
confidence: 99%