Interpretable Spatio-Temporal Attention for Video Action Recognition

Meng, Lili; Zhao, Bo; Chang, Bo; Huang, Gao; Sun, Wei; Tung, Frederick; Sigal, Leonid

doi:10.1109/iccvw.2019.00189

Cited by 82 publications

(42 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A soft attention mechanism [1] was inserted before the feature extraction model and is responsible for blacking-out irrelevant parts of input to minimize their impacts on prediction. The spatial attention block consists of a convolutional network with the 'same' padding in all layers that learns to produce an importance mask for each input image, which is then multiplied element-wise with the original input image.…”

Section: Spatial Attentionmentioning

confidence: 99%

Recognizing Human-Object Interaction in Multi-Camera Environments

Tropmann-Frick¹,

Tran²

2020

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

This work introduces Multi-Fusion Network for human-object interaction detection with multiple cameras. We present a concept and implementation of the architecture for a beverage refrigerator with multiple cameras as proof-of-concept. We also introduce an effective approach for minimizing the required amount of training data for the network as well as reducing the risk of overfitting, especially when dealing with a small data set that is commonly recorded by a person or small organization. The model achieved high test accuracy and comparable results in a real-world scenario at the Event Solutions in Hamburg 2019. Multi-Fusion Network is easy to scale due to shared learnable parameters. It is also lightweight, hence suitable to run on small devices with average computation capability. Furthermore, it can be used for smart home applications, gaming experiences, or mixed reality applications.

show abstract

Section: Spatial Attentionmentioning

confidence: 99%

Recognizing Human-Object Interaction in Multi-Camera Environments

Tropmann-Frick¹,

Tran²

2020

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

show abstract

“…In [27], the authors mention a hierarchical attention network for document classification. Recently, attention network has also been proposed for action recognition in videos [28], [29], [30]. In [28], the authors introduce an attention mechanism as low-rank second-order pooling for single image classification.…”

Section: B Attention Networkmentioning

confidence: 99%

“…In [29], the authors developed an attention-based neural network in order to model the scene objects interaction for action recognition and video captioning. In [30], authors introduce two separate temporal and spatial attention mechanism to identify the most relevant frame and the most-relevant spatial location in that frame of a particular action.…”

Section: B Attention Networkmentioning

confidence: 99%

Human Action Recognition in Unconstrained Trimmed Videos Using Residual Attention Network and Joints Path Signature

et al. 2019

View full text Add to dashboard Cite

Action recognition has been achieved great progress in recent years because of better feature representation learning and classification technology like convolutional neural networks (CNNs). However, most current deep learning approaches treat the action recognition as a black box, ignoring the specific domain knowledge of action itself. In this paper, by analyzing the characteristics of different actions, we proposed a new framework that involves residual-attention module and joint path-signature feature (JPSF) representation framework. The path signature theory was developed recently in the field of rough path and stochastic analysis, which provides a very efficient way to analyze any temporal sequence data. The proposed n-fold joint path signature features entail the Euclidean distances between joints and respective angles. For our experiment, JPSF for three modalities of joints (spatial location, bi-folds and tri-folds) are computed over the temporal length of action sequence. Then all these PSF are concatenated and fed to a CNN to give the recognition result. Experiments on three benchmark datasets, J-HMDB, HMDB-51 and UCF-101, indicate that our proposed method achieves state-of-the-art performance. INDEX TERMS Convolutional neural networks, residual-attention, path signature features.

show abstract

“…Temporal attention performs the same selective approach as seen in spatial attention but in the time domain, choosing certain timesteps over others. Although there is a smaller body of research focusing on temporal attention in the machine learning community [25,17,11] , it is well-established in the neuroscientific community that the combination of spatial and temporal attention in human perception is a critical aspect of perceiving occulted and moving objects in time [3] [10], two for the most prominent features of a flickering Atari game screen. The authors combine spatial and temporal attention in an attempt to recreate the additive effects [10] seen in human perception, in a deep RL agent.…”

Section: Spatio-temporal Model For Pomdpsmentioning

confidence: 99%

Spatio-Temporal Attention Deep Recurrent Q-Network for POMDPs

Etchart

Ładosz

Mulvaney

2019

Progress in Artificial Intelligence

View full text Add to dashboard Cite

One of the long-standing challenges for reinforcement learning agents is to deal with noisy environments. Although progress has been made in producing an agent capable of optimizing its environment in fully observable conditions, partial observability still remains a difficult task. In this paper, a background review of the sub-field of partial observability is made and a novel model is proposed which inspired by human perception, utilizes two fundamental machine learning concepts, attention and memory, to better confront a noisy environment.

show abstract

Interpretable Spatio-Temporal Attention for Video Action Recognition

Cited by 82 publications

References 47 publications

Recognizing Human-Object Interaction in Multi-Camera Environments

Recognizing Human-Object Interaction in Multi-Camera Environments

Human Action Recognition in Unconstrained Trimmed Videos Using Residual Attention Network and Joints Path Signature

Spatio-Temporal Attention Deep Recurrent Q-Network for POMDPs

Contact Info

Product

Resources

About