2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 2019
DOI: 10.1109/iccvw.2019.00189
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable Spatio-Temporal Attention for Video Action Recognition

Abstract: arXiv:1810.04511v2 [cs.CV] 3 Jun 2019 frame 1 frame 2 frame n CNN CNN CNN Spatial Attention Spatial Attention Spatial Attention Temporal Attention Temporal Attention Temporal Attention Convolution LSTM Convolution LSTM Convolution LSTM AVG Playing Volleyball X1 < l a t e x i t s h a 1 _ b a s e 6 4 = " B G + 8 X 3 r Z G o V 3 v z 6 / F F z d q k I g 7 b M = " > A A A B 6 n i c b Z B N S 8 N A E I Y n 9 a v W r 6 p H L 4 t F 8 F Q S E f Q k B S 8 e K 9 o P a E P Z b D f t 0 s 0 m 7 E 6 E E v o T v H h Q x K u /… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
38
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 82 publications
(42 citation statements)
references
References 47 publications
0
38
0
Order By: Relevance
“…A soft attention mechanism [1] was inserted before the feature extraction model and is responsible for blacking-out irrelevant parts of input to minimize their impacts on prediction. The spatial attention block consists of a convolutional network with the 'same' padding in all layers that learns to produce an importance mask for each input image, which is then multiplied element-wise with the original input image.…”
Section: Spatial Attentionmentioning
confidence: 99%
“…A soft attention mechanism [1] was inserted before the feature extraction model and is responsible for blacking-out irrelevant parts of input to minimize their impacts on prediction. The spatial attention block consists of a convolutional network with the 'same' padding in all layers that learns to produce an importance mask for each input image, which is then multiplied element-wise with the original input image.…”
Section: Spatial Attentionmentioning
confidence: 99%
“…In [27], the authors mention a hierarchical attention network for document classification. Recently, attention network has also been proposed for action recognition in videos [28], [29], [30]. In [28], the authors introduce an attention mechanism as low-rank second-order pooling for single image classification.…”
Section: B Attention Networkmentioning
confidence: 99%
“…In [29], the authors developed an attention-based neural network in order to model the scene objects interaction for action recognition and video captioning. In [30], authors introduce two separate temporal and spatial attention mechanism to identify the most relevant frame and the most-relevant spatial location in that frame of a particular action.…”
Section: B Attention Networkmentioning
confidence: 99%
“…Temporal attention performs the same selective approach as seen in spatial attention but in the time domain, choosing certain timesteps over others. Although there is a smaller body of research focusing on temporal attention in the machine learning community [25,17,11] , it is well-established in the neuroscientific community that the combination of spatial and temporal attention in human perception is a critical aspect of perceiving occulted and moving objects in time [3] [10], two for the most prominent features of a flickering Atari game screen. The authors combine spatial and temporal attention in an attempt to recreate the additive effects [10] seen in human perception, in a deep RL agent.…”
Section: Spatio-temporal Model For Pomdpsmentioning
confidence: 99%