2023
DOI: 10.1109/tpami.2023.3243465
|View full text |Cite
|
Sign up to set email alerts
|

Video Transformers: A Survey

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 43 publications
(18 citation statements)
references
References 141 publications
0
8
0
Order By: Relevance
“…However, the limited training data and the model complexities remained one of the primary factors of model performance. Transformers have also been used for tasks beyond NLP, such as image and video processing [95], and they are an active area of research in the deep learning community.…”
Section: Introductionmentioning
confidence: 99%
“…However, the limited training data and the model complexities remained one of the primary factors of model performance. Transformers have also been used for tasks beyond NLP, such as image and video processing [95], and they are an active area of research in the deep learning community.…”
Section: Introductionmentioning
confidence: 99%
“…Interpretable spatio-temporal attention [48] used spatial and temporal attention via ConvLSTM. Recent selfattention mechanisms are also introduced in STA-TSN [49] and GTA [50], as well as Transformer-based video models [3]. Although some of these methods do not aim to visual explanation, the blurry map issue still remains for videos because the ability of temporal modeling, which is useful for classification, may be harmful to capture sharp spatial attention maps.…”
Section: Related Workmentioning
confidence: 99%
“…In multimodal learning, models process and integrate data from multiple modalities [5,6,45], with applications in visual and language learning [43], video understanding [46,47], and natural language understanding [29,30,35]. However, expensive human annotations are often required for effective training.…”
Section: Self-supervised Multimodal Learningmentioning
confidence: 99%