STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

Yang, Zhaoqilin; An, Gaoyun; Zhang, Ruichen

doi:10.3390/math10183290

Cited by 6 publications

(2 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lin et al [27] proposed a time shift module for hardware efficient video recognition, which moves part of the channel along the time dimension to exchange information with adjacent frames. Yang et al [28] proposed a spatial-temporal displacement module for efficient video recognition. This module moves some channels in the time dimension and space dimension of different channels, enabling the network to learn its spatial-temporal characteristics.…”

Section: B Time Module Pluginmentioning

confidence: 99%

Temporal superimposed crossover module for effective continuous sign language

Zhu¹,

Li²,

Yuan³

et al. 2022

Preprint

View full text Add to dashboard Cite

The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the realtime and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multilevel CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.

show abstract

Section: B Time Module Pluginmentioning

confidence: 99%

Temporal superimposed crossover module for effective continuous sign language

Zhu¹,

Li²,

Yuan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This module can be inserted in existing 2D CNNs to achieve time modeling of zero computation and zero parameters. Although TSM was widely used in tasks such as video classification and action recognition [ 22 , 23 ], no researcher applied TSM to the field of pig farming. In this study, we aimed to insert TSM into four widely used 2D CNN models, which enhance the model’s learning ability on time features, while maintaining the model’s performance in handling spatial features.…”

Section: Introductionmentioning

confidence: 99%

Efficient Aggressive Behavior Recognition of Pigs Based on Temporal Shift Module

Teng

et al. 2023

Animals

View full text Add to dashboard Cite

Aggressive behavior among pigs is a significant social issue that has severe repercussions on both the profitability and welfare of pig farms. Due to the complexity of aggression, recognizing it requires the consideration of both spatial and temporal features. To address this problem, we proposed an efficient method that utilizes the temporal shift module (TSM) for automatic recognition of pig aggression. In general, TSM is inserted into four 2D convolutional neural network models, including ResNet50, ResNeXt50, DenseNet201, and ConvNext-t, enabling the models to process both spatial and temporal features without increasing the model parameters and computational complexity. The proposed method was evaluated on the dataset established in this study, and the results indicate that the ResNeXt50-T (TSM inserted into ResNeXt50) model achieved the best balance between recognition accuracy and model parameters. On the test set, the ResNeXt50-T model achieved accuracy, recall, precision, F1 score, speed, and model parameters of 95.69%, 95.25%, 96.07%, 95.65%, 29 ms, and 22.98 M, respectively. These results show that the proposed method can effectively improve the accuracy of recognizing pig aggressive behavior and provide a reference for behavior recognition in actual scenarios of smart livestock farming.

show abstract