2015
DOI: 10.48550/arxiv.1511.04119
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Action Recognition using Visual Attention

Abstract: We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the model on UCF-11 (YouTube … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
183
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 117 publications
(192 citation statements)
references
References 15 publications
2
183
0
Order By: Relevance
“…Self-Attention Mechanism. The self-attention [38] mechanism is widely used in the video understanding area since it can effectively capture long-term dependencies compared with other attention methods such as recurrent models [23] and pooling methods [12]. The Transformer [33] is also based on the self-attention mechanism, which is originally applied in the machine translation task.…”
Section: Related Workmentioning
confidence: 99%
“…Self-Attention Mechanism. The self-attention [38] mechanism is widely used in the video understanding area since it can effectively capture long-term dependencies compared with other attention methods such as recurrent models [23] and pooling methods [12]. The Transformer [33] is also based on the self-attention mechanism, which is originally applied in the machine translation task.…”
Section: Related Workmentioning
confidence: 99%
“…The BERT model [4] further combined this property with attention-based selection scheme for the selftraining of language model. There are also some trials in vision problems to incorporate contextual modeling with action recognition [9,17,21,34], but mostly focus on the spatial and short-term context, while accurate atomic action detection requires both shortterm and long-term cues in spatiotemporal domain.…”
Section: Long-term Context Reasoningmentioning
confidence: 99%
“…Video recognition methods that use an attention mechanism [2,5,24,29,38,40,52,53,56,58,61,62,65] have also been proposed [6,10,18,46,56,59,70]. Non-local neural networks [56], which are commonly used for introducing an attention mechanism, improve the accuracy of video recognition by capturing long-distance temporal dependency with a non-local operation capable of providing global information.…”
Section: Video Recognitionmentioning
confidence: 99%