Video Swin Transformer

Liu, Ze; Jia, Ning; Cao, Yue; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Hu, Han

doi:10.1109/cvpr52688.2022.00320

Cited by 542 publications

(130 citation statements)

References 23 publications

Supporting

Mentioning

125

Contrasting

Order By: Relevance

“…However, CNN has a limited receptive field and cannot effectively capture longrange dependency. Recent works have extended Vision Transformer [13] for video representation and demonstrated the benefit of long-range temporal learning [5,33]. To reduce the computational cost, TimeSformer [5] introduces a factorized spacetime attention, while Video Swin-Transformer [32] restricts self-attention in a local 3D window.…”

Section: Video Representationmentioning

confidence: 99%

“…Although long-form video-language joint learning has been explored in downstream tasks [16,27,28,30,58,60,62], they either use pre-extracted video features which lead to the sub-optimal problem, or utilize image encoder to extract frame features that fail to model the long-range dependency in long-form videos. Recent works [3,5,33] have shown that a video Transformer [48] backbone helps to capture long-range dependency in an end-to-end fashion. An intuitive way for long-form video-language pre-training is to adopt a video Transformer based short-form video-language pretraining model [3,54] with long-form data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Sun¹,

Hou²,

Song³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video questionanswering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain. * This work was performed when Yuchong Sun and Hongwei Xue were visiting Microsoft Research as research interns.† Ruihua Song and Bei Liu are the corresponding authors.36th Conference on Neural Information Processing Systems (NeurIPS 2022).

show abstract

Section: Video Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Sun¹,

Hou²,

Song³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Spacetime attention in video transformers. With the advances of the Vision Transformer [25] as a new way to extract image embeddings, many 'spatial-temporal transformer' architectures have been developed in the video domain [26][27][28]. These works explore and propose interesting solutions for how to organize spatial attention and temporal attention with either coupled (series) [28] and factorized (parallel) attention blocks [26], as well as how to create better tokens for videos by creating three-dimensional spatio-temporal 'tubes' as the tubelet tokenizations [26].…”

Section: Transformersmentioning

confidence: 99%

Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers

Liu

Azabou

Dabagia

et al. 2022

Preprint

View full text Add to dashboard Cite

Complex time-varying systems are often studied by abstracting away from the dynamics of individual components to build a model of the population-level dynamics from the start. However, when building a population-level description, it can be easy to lose sight of each individual and how each contributes to the larger picture. In this paper, we present a novel transformer architecture for learning from time-varying data that build descriptions of both the individual as well as the collective population dynamics. Rather than combining all of our data into our model at the onset, we develop a separable architecture that operates on individual time-series first before passing them forward; this induces a permutation-invariance property and can be used to transfer across systems of different size and order. After demonstrating that our model can be applied to successfully recover complex interactions and dynamics in many-body systems, we apply our approach to populations of neurons in the nervous system. On neural activity datasets, we show that our multi-scale transformer not only yields robust decoding performance, but also provides impressive performance in transfer. Our results show that it is possible to learn from neurons in one animal's brain and transfer the model on neurons in a different animal's brain, with interpretable neuron correspondence across sets and animals. This finding opens up a new path to decode from and represent large collections of neurons.

show abstract

“…The experimental results show that the algorithm achieves a good tradeoff between speed and performance. Based on the picture classification structure, Video Swin Transformer [42] adds the time dimension, and good results are achieved. ViViT [43] discussed four different ways to realize spatiotemporal attention on the basis of VIT [40].…”

Section: Related Workmentioning

confidence: 99%

A Video Classification Method Based on Spatiotemporal Detail Attention and Feature Fusion

Gong

2022

Mobile Information Systems

View full text Add to dashboard Cite

With the explosive growth of Internet video data, demands for accurate large-scale video classification and management are increasing. In the real-world deployment, the balance between effectiveness and timeliness should be fully considered. Generally, the video classification algorithm equipped with time segment network is used in industrial deployment, and the frame extraction feature is used to classify video actions However, the issue of semantic deviation will be raised due to coarse feature description. In this paper, we propose a novel method, called image dense feature and internal significant detail description, to enhance the generalization and discrimination of feature description. Specifically, the location information layer of space-time geometric relationship is added to effectively engrave the local features of convolution layer. Moreover, the multimodal feature graph network is introduced to effectively improve the generalization ability of feature fusion. Extensive experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (kinetics 400 and kinetics 600).

show abstract

Video Swin Transformer

Cited by 542 publications

References 23 publications

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers

A Video Classification Method Based on Spatiotemporal Detail Attention and Feature Fusion

Contact Info

Product

Resources

About