VidTr: Video Transformer Without Convolutions

Zhang, Yanyi; Li, Xinyu; Liu, Chunhui; Shuai, Bing; Zhu, Yi; Brattoli, Biagio; Chen, Hao; Marsic, Ivan; Tighe, Joseph

doi:10.1109/iccv48922.2021.01332

Cited by 139 publications

(58 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Patch Tokenization. Most VTs follow ViT [7] and employ a 2D-based patch tokenization [9], [88], [114], [138], dividing the input video frames into regions of fixed size h × w. For instance, 16 × 16 [9], [88], [138] or even multi-scale patch tokenization, ranging from 9 × 5 to 108 × 60 [114]. Others propose using 3D patches instead [11], [12], [48], [49], taking into account the time dimension in small t × h × w regions (e.g., 4 × 16 × 16 [49]).…”

Section: Tokenizationmentioning

confidence: 99%

“…Other works have leveraged full CNN backbone architectures and still manage to train end-to-end, either with a pre-trained backbone [55], [112], [117], or training it from scratch with the Transformer [82], [97], [123], [137]. Some were able to do so thanks to using only a few (from 1 to 4) Transformer layers [42], [76], [138], showing that adding a few Transformer layers right after a large backbone may be enough to boost performance. Some other's success is attributable to leveraging efficient designs, such as local SA [119] or weight sharing [64] -as seen in Sec.…”

Section: Training Regimementioning

confidence: 99%

“…Transformers have also been applied for action anticipation [67], [80], sign-language translation [78], [136], visualquestion answering [55], [138], autonomous driving [218], robot navigation [134], visual-language navigation [66], personality recognition [142], lip reading [137], dynamic scene graph generation [131], and multimedia recomendation [141]. As not many video Transformers have tackled this, it is too early to ascertain specific trends, so we simply list them here for completeness.…”

Section: S29 Other Tasksmentioning

confidence: 99%

See 2 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract

Section: Tokenizationmentioning

confidence: 99%

Section: Training Regimementioning

confidence: 99%

Section: S29 Other Tasksmentioning

confidence: 99%

See 1 more Smart Citation

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Following this trend, a high number of models has been proposed, which are designed to achieve high accuracy [44], resource-efficiency [45], [46] or are tailored for specific tasks, such as object detection [47] or semantic segmentation [22]. While multiple works leverage visual transformer models in conventional video-based human activity classification [19], [48], [49], or standard sequence transformers for skeleton encodings [28], [50], [51], their potential of visual transformers as data-efficient encoders of body movement cast as images has not been considered yet and is the main motivation of our work. Furthermore, inspired by the recent success of feature augmentation method in semi-supervised learning [24], we for the first time propose training a visual transformer with an additional auxiliary branch for augmenting the embeddings using category-specific prototypes and self-attention.…”

Section: Related Workmentioning

confidence: 99%

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Peng¹,

Roitberg²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatically understanding human behaviour allows household robots to identify the most critical needs and plan how to assist the human according to the current situation. However, the majority of such methods are developed under the assumption that a large amount of labelled training examples is available for all concepts-of-interest. Robots, on the other hand, operate in constantly changing unstructured environments, and need to adapt to novel action categories from very few samples. Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays and then used as input to convolutional neural networks. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning. Inspired by recent success of feature enhancement methods in semi-supervised learning, we further introduce PROFORMER -an improved training strategy which uses soft-attention applied on iteratively estimated action category PROtotypes used to augment the embeddings and compute an auxiliary consistency loss. Extensive experiments consistently demonstrate the effectiveness of our approach for one-shot recognition from body poses, achieving state-of-the-art results on multiple datasets and surpassing the best published approach on the challenging NTU-120 one-shot benchmark by 1.84%. Our code will be made publicly available at https: //github.com/KPeng9510/ProFormer.

show abstract

“…Token adoptation in vision tasks: At the moment, tokenbased models are widely applied in almost all domains in vision, including classification [22,46,65], object detection [6,16,90], segmentation [23,74], image generation [5,20,24,38,43], video understanding [1,2,4,9,25,28,41,45,47,49,56,85], dense prediction [54,75], point clouds processing [30,88], reinforcement learning [10,37] and tracking [60].…”

Section: Related Workmentioning

confidence: 99%

SWAT: Spatial Structure Within and Among Tokens

Kahatapitiya¹,

Ryoo²

2021

Preprint

View full text Add to dashboard Cite

Modeling visual data as tokens (i.e., image patches), and applying attention mechanisms or feed-forward networks on top of them has shown to be highly effective in recent years. The common pipeline in such approaches includes a tokenization method, followed by a set of layers/blocks for information mixing, both within tokens and among tokens. In common practice, image patches are flattened when converted into tokens, discarding the spatial structure within each patch. Next, a module such as multi-head self-attention captures the pairwise relations among the tokens and mixes them. In this paper, we argue that models can have significant gains when spatial structure is preserved in tokenization, and is explicitly used in the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code and models will be released online.

show abstract

VidTr: Video Transformer Without Convolutions

Cited by 139 publications

References 29 publications

Video Transformers: A Survey

Video Transformers: A Survey

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

SWAT: Spatial Structure Within and Among Tokens

Contact Info

Product

Resources

About