TSM: Temporal Shift Module for Efficient Video Understanding

Lin, Ji; Gan, Chuang; Han, Song

doi:10.1109/iccv.2019.00718

Cited by 1,589 publications

(1,369 citation statements)

References 59 publications

Supporting

Mentioning

1,273

Contrasting

Unclassified

Order By: Relevance

“…As shown in Table 2 Other SOTA Methods. We also compare our method with other recentlyproposed approaches, including TSM [20], STM [14], and ABM [40]. As Table 2 shows, STH with 8-frame input already outperforms ABM with (16 × 3)frame input.…”

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 94%

“…TRN [38] can learn temporal reasoning relationship from the features of the last layer, but its performance is still inferior to ours. The recently-proposed TSM [20] method performs better than other 2D based methods including TRN [38] and MFNet [18] as it has stronger temporal modeling ability across all levels. Compared to TSM, our proposed STH network achieves new state-of-the-art performance with 46.8% top-1 accuracy at T = 8 and 48.3% top-1 accuracy at T = 16, with even lower computational complexity.…”

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 97%

“…Compared to the 2D-Conv models including TSN [33] and TRN [38], our approach shows clear performance improvement on both datasets. For Something-Something V2 dataset, our model outperforms the current state-ofthe-art efficient method TSM [20] with fewer FLOPs and parameters, using the same sampling strategy. In addition, our method sampling with only 8 frames performs better than the latest model ABM [40].…”

Section: Comparison On Other Datasetsmentioning

confidence: 97%

“…However, it sacrifices the temporal modeling capability at some 2D layers. The latest method TSM (Temporal Shift Module) [20] shift part of the channels along the temporal dimension for efficient temporal modeling. STM (Spatio-Temporal and Motion Encoding) [14] designs two module, specifically, a channel-wise spatio-temporal module to encode spatiotemporal feature and a channel-wise motion module to learn motion feature.…”

Section: Related Workmentioning

confidence: 99%

“…Generally speaking, spatial information focus on static appearance features such as actors and objects from a video, while temporal information can be regarded as an indicator for recognizing motion and action. State-of-the-art approaches leverage both spatial and temporal information to enhance the performance of action recognition [22,20,2,29,6,5]. Temporal modeling is of key importance for recognizing actions.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Spatio-temporal Hybrid Network for Action Recognition

Zhao

2019

2019 IEEE Visual Communications and Image Processing (VCIP)

View full text Add to dashboard Cite

Effective and Efficient spatio-temporal modeling is essential for action recognition. Existing methods suffer from the trade-off between model performance and model complexity. In this paper, we present a novel Spatio-Temporal Hybrid Convolution Network (denoted as "STH") which simultaneously encodes spatial and temporal video information with a small parameter cost. Different from existing works that sequentially or parallelly extract spatial and temporal information with different convolutional layers, we divide the input channels into multiple groups and interleave the spatial and temporal operations in one convolutional layer, which deeply incorporates spatial and temporal clues. Such a design enables efficient spatio-temporal modeling and maintains a small model scale. STH-Conv is a general building block, which can be plugged into existing 2D CNN architectures such as ResNet and Mo-bileNet by replacing the conventional 2D-Conv blocks (2D convolutions). STH network achieves competitive or even better performance than its competitors on benchmark datasets such as Something-Something (V1 & V2), Jester, and HMDB-51. Moreover, STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.

show abstract

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 94%

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 97%

Section: Comparison On Other Datasetsmentioning

confidence: 97%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Spatio-temporal Hybrid Network for Action Recognition

Zhao

2019

2019 IEEE Visual Communications and Image Processing (VCIP)

View full text Add to dashboard Cite

show abstract

Tiny Video Networks

2022

View full text Add to dashboard Cite

Automatic video understanding is becoming more important for applications where real‐time performance is crucial and compute is limited: for example, automated video tagging, robot perception, activity recognition for mobile devices. Yet, accurate solutions so far have been computationally intensive. We propose efficient models for videos—Tiny Video Networks—which are video architectures, automatically designed to comply with fast runtimes and, at the same time are effective at video recognition tasks. The TVNs run at faster‐than‐real‐time speeds and demonstrate strong performance across several video benchmarks. These models not only provide new tools for real‐time video applications, but also enable fast research and development in video understanding. Code and models are available.

show abstract

Human‐centered attention‐aware networks for action recognition

Liu

2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

Action recognition in video is a research hot spot in the field of computer vision. Learning important clues in video context has significant effect to promote the interaction prediction and gesture recognition. Most existing methods infer the interactions between actor and context through relational reasoning methods. While these relational features contribute to improve the salience of action performance, the error will occur when the salient region is irrelevant to the recognized action. Therefore, this paper establishes a human‐centered attention mechanism that dynamically highlights regions associated with action recognition according to target appearance to selectively recognize the human‐object interaction action. The effectiveness of the proposed mechanism is verified on the AVA2.2 data set, and the visualized attention map further shows that the proposed attention mechanism can effectively recognize human‐centered strongly correlated action.

show abstract

TSM: Temporal Shift Module for Efficient Video Understanding

Cited by 1,589 publications

References 59 publications

A Spatio-temporal Hybrid Network for Action Recognition

A Spatio-temporal Hybrid Network for Action Recognition

Tiny Video Networks

Human‐centered attention‐aware networks for action recognition

Contact Info

Product

Resources

About