2021
DOI: 10.48550/arxiv.2106.05392
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Abstract: In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers-trajectory attention-that aggregates information along implicit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 37 publications
1
7
0
Order By: Relevance
“…Neverthe-less, it is witnessed that the attention process shall be done with each frame independently, so that the model will attend to locations following the movement of instance through the video. This observation aligns with the conclusion drawn in action recognition [17,28], in which the 1D time domain and 2D space domain have different characteristics and should be handled in a different fashion.…”
Section: Introductionsupporting
confidence: 89%
“…Neverthe-less, it is witnessed that the attention process shall be done with each frame independently, so that the model will attend to locations following the movement of instance through the video. This observation aligns with the conclusion drawn in action recognition [17,28], in which the 1D time domain and 2D space domain have different characteristics and should be handled in a different fashion.…”
Section: Introductionsupporting
confidence: 89%
“…Due to the local nature of 2D and 3D convolutions, most of these models typically operate on short video clips of a few seconds. Inspired by the success of Transformer models in natural language processing (NLP), recently the transformerbased models have been successfully used for video recognition tasks [5,2,35,15,38]. However, due to the quadratic cost of the self-attention operation, these models are very computationally costly and, thus, only applied to short-range video segments.…”
Section: Related Workmentioning
confidence: 99%
“…The majority of modern video recognition models [8,16,41,17,46,25,5,2,35,15,38] are unfortunately not equipped to solve these tasks as they are designed for short-range videos (e.g., 5-10 seconds in duration). Furthermore, extending these models to the long-range video setting by simply adding more input video frames is impractical due to excessive computational cost and GPU memory consumption.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, transformers are applied to model spatiotemporal dependencies for video recognition [1,3,4,13,38,40,41,62,63] by virtue of their great power in capturing long-range dependencies [9,12,36,48]. With pretraining on a large-scale image dataset, video transformers achieve the best reported accuracy on video benchmarks [1,3,38], such as Kinetics-400/600 [28].…”
Section: Transformer-based Video Recognitionmentioning
confidence: 99%