STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video

Yang, Yongqing; Yang, Yong; Lu, Zhengzhi; Yang, Junjie; Liu, Deyang; Zhou, Chuanbo; Fan, Zhibin

doi:10.1371/journal.pone.0265115

Cited by 19 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, our employed boosted CSAA model has demonstrated competitive performance in contrast to multi‐stream CNN and LSTM‐based models [51, 88, 89]. The method outperforms the attention modules proposed in residual CNN structure from only RGB frames or in combination with optical flows [93]. The introduced encoded motion information additionally outperforms various two‐stream‐based methods, which typically involve stacking optical flow data as a separate stream to enhance performance [53].…”

Section: Experiments and Discussionmentioning

confidence: 99%

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Khezerlou,

Baradarani,

Balafar

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

This paper introduces a new descriptor called orientation‐magnitude response maps as a single 2D image to effectively explore motion patterns. Moreover, boosted multi‐stream CNN‐based model with various attention modules is designed for human action recognition. The model incorporates a convolutional self‐attention autoencoder to represent compressed and high‐level motion features. Sequential convolutional self‐attention modules are used to exploit the implicit relationships within motion patterns. Furthermore, 2D discrete wavelet transform is employed to decompose RGB frames into discriminative coefficients, providing supplementary spatial information related to the actors actions. A spatial attention block, implemented through the weighted inception module in a CNN‐based structure, is designed to weigh the multi‐scale neighbours of various image patches. Moreover, local and global body pose features are combined by extracting informative joints based on geometry features and joint trajectories in 3D space. To provide the importance of specific channels in pose descriptors, a multi‐scale channel attention module is proposed. For each data modality, a boosted CNN‐based model is designed, and the action predictions from different streams are seamlessly integrated. The effectiveness of the proposed model is evaluated across multiple datasets, including HMDB51, UTD‐MHAD, and MSR‐daily activity, showcasing its potential in the field of action recognition.

show abstract

Section: Experiments and Discussionmentioning

confidence: 99%

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Khezerlou,

Baradarani,

Balafar

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…For example, when TSN is plugged into GSM [3], an accuracy improvement of 32% is achieved. Furthermore, Yang et al [4] used TSN with a soft attention mechanism to capture important frames from each segment. Moreover, Zhang et al [5] have used the TSN model as a feature extractor with ResNet101 for efficient behavior recognition of pigs.…”

Section: Multimodal Recognition Methodsmentioning

confidence: 99%

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Shaikh,

Chai,

Islam

et al. 2024

Data

View full text Add to dashboard Cite

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can be used as a benchmark dataset for evaluating the performance of machine learning models for human action recognition and related tasks. These audio-image representations could be suitable for a wide range of applications, such as surveillance, healthcare monitoring, and robotics. The dataset can also be used for transfer learning, where pre-trained models can be fine-tuned on a specific task using specific audio images. Thus, this dataset can facilitate the development of new techniques and approaches for improving the accuracy of human action-related tasks and also serve as a standard benchmark for testing the performance of different machine learning models and algorithms.

show abstract

“…Interpretable spatio-temporal attention [48] used spatial and temporal attention via ConvLSTM. Recent selfattention mechanisms are also introduced in STA-TSN [49] and GTA [50], as well as Transformer-based video models [3]. Although some of these methods do not aim to visual explanation, the blurry map issue still remains for videos because the ability of temporal modeling, which is useful for classification, may be harmful to capture sharp spatial attention maps.…”

Section: Related Workmentioning

confidence: 99%

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Nitta

Hirakawa

Fujiyoshi

et al. 2023

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the Prototype Conformity (PC) loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.

show abstract

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video

Cited by 19 publications

References 41 publications

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

Spatio‐temporal attention modules in orientation‐magnitude‐response guided multi‐stream CNNs for human action recognition

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Contact Info

Product

Resources

About