2021
DOI: 10.48550/arxiv.2103.15691
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ViViT: A Video Vision Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
255
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 106 publications
(256 citation statements)
references
References 0 publications
1
255
0
Order By: Relevance
“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”
Section: Vision Transformermentioning
confidence: 99%
“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”
Section: Vision Transformermentioning
confidence: 99%
“…The elegance of ViT [23] has also motivated similar model designs with simpler global operators such as MLP-Mixer [85], gMLP [53], GFNet [74], and FNet [43], to name a few. Despite successful applications to many high-level tasks [4,23,56,83,87,100], the efficacy of these global models on low-level enhancement and restoration problems has not been studied extensively. The pioneering works on Transformers for lowlevel vision [9,14] directly applied full self-attention, which only accepts relatively small patches of fixed sizes (e.g., 48×48).…”
Section: Enhancementmentioning
confidence: 99%
“…Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20]. Specially, concurrent semantic segmentation methods driven by ViT presents impressive performance.…”
Section: Vision Transformermentioning
confidence: 99%