Multiscale Vision Transformers

Fan, Haoqi; Xiong, Bo; Mangalam, Karttikeya; Li, Yanghao; Yan, Zhicheng; Malik, Jitendra; Feichtenhofer, Christoph

doi:10.1109/iccv48922.2021.00675

Cited by 756 publications

(391 citation statements)

References 45 publications

Supporting

Mentioning

324

Contrasting

Order By: Relevance

“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”

Section: Related Workmentioning

confidence: 99%

“…As in existing video transformer models [7,2], we pre-process the video into a sequence of ST tokens x st ∈ R D , for a spatial resolution of S and a temporal resolution of T . We use a cuboid embedding [2,22], where disjoint spatio-temporal cubes from the input volume are linearly projected to R D (equivalent to a 3D convolution with downsampling). We also test an embedding of disjoint image patches [20].…”

Section: Trajectory Attention For Video Datamentioning

confidence: 99%

“…We consider the effect of different input tokenization approaches for both joint and trajectory attention on Kinetics-400 (K-400) and Something-Something V2 (SSv2) in Table 2b. For patch tokenization (1×16×16), we use inputs of size 8×224×224, while for cubic [2,22] tokenization (2×16×16), we use inputs of size 16×224×224 to ensure that the model has the same number of input tokens over the same temporal range of 2 seconds. For both attention types, we see that cubic tokenization gives a 1% accuracy improvement over square tokenization on SSv2, a dataset for which temporal information is critical.…”

Section: Ablation Studiesmentioning

confidence: 99%

“…Input: positional encoding. Here, we ablate using a joint or separate [22] (default) space-time positional encoding in Table 2b. Similar to the results for input tokenization, the choice of positional encoding is particularly important for the fine-grained motion dataset, SSv2.…”

Section: Ablation Studiesmentioning

confidence: 99%

See 3 more Smart Citations

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Patrick¹,

Campbell²,

Asano³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers-trajectory attention-that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something-Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/ Motionformer. * Equal contribution.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Trajectory Attention For Video Datamentioning

confidence: 99%

Section: Ablation Studiesmentioning

confidence: 99%

Section: Ablation Studiesmentioning

confidence: 99%

See 2 more Smart Citations

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Patrick¹,

Campbell²,

Asano³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Action Recognition with Transformer. Following the vision transformer (ViT) [13], which demonstrates competitive performance against CNN models on image classification, many recent works attempt to extend the vision transformer for action recognition [36,25,3,1,14]. VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction.…”

Section: Related Workmentioning

confidence: 99%

Can An Image Classifier Suffice For Action Recognition?

Fan¹,

Chun-Fu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Ki-netics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.

show abstract

Video Anomaly Detection Utilizing Efficient Spatiotemporal Feature Fusion with 3D Convolutions and Long Short‐Term Memory Modules

Ul Amin,

Kim,

Jung

et al. 2024

Advanced Intelligent Systems

View full text Add to dashboard Cite

Surveillance cameras produce vast amounts of video data, posing a challenge for analysts due to the infrequent occurrence of unusual events. To address this, intelligent surveillance systems leverage AI and computer vision to automatically detect anomalies. This study proposes an innovative method combining 3D convolutions and long short‐term memory (LSTM) modules to capture spatiotemporal features in video data. Notably, a structured coarse‐level feature fusion mechanism enhances generalization and mitigates the issue of vanishing gradients. Unlike traditional convolutional neural networks, the approach employs depth‐wise feature stacking, reducing computational complexity and enhancing the architecture. Additionally, it integrates microautoencoder blocks for downsampling, eliminates the computational load of ConvLSTM2D layers, and employs frequent feature concatenation blocks during upsampling to preserve temporal information. Integrating a Conv‐LSTM module at the down‐ and upsampling stages enhances the model's ability to capture short‐ and long‐term temporal features, resulting in a 42‐layer network while maintaining robust performance. Experimental results demonstrate significant reductions in false alarms and improved accuracy compared to contemporary methods, with enhancements of 2.7%, 0.6%, and 3.4% on the UCSDPed1, UCSDPed2, and Avenue datasets, respectively.

show abstract

Multiscale Vision Transformers

Cited by 756 publications

References 45 publications

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Can An Image Classifier Suffice For Action Recognition?

Video Anomaly Detection Utilizing Efficient Spatiotemporal Feature Fusion with 3D Convolutions and Long Short‐Term Memory Modules

Contact Info

Product

Resources

About