Action Recognition using Visual Attention

Sharma, Shikhar; Kiros, Ryan; Salakhutdinov, Ruslan

doi:10.48550/arxiv.1511.04119

Cited by 117 publications

(192 citation statements)

References 15 publications

Supporting

Mentioning

183

Contrasting

Order By: Relevance

“…Self-Attention Mechanism. The self-attention [38] mechanism is widely used in the video understanding area since it can effectively capture long-term dependencies compared with other attention methods such as recurrent models [23] and pooling methods [12]. The Transformer [33] is also based on the self-attention mechanism, which is originally applied in the machine translation task.…”

Section: Related Workmentioning

confidence: 99%

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Qing¹,

Su²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet important task in the video understanding field. The proposals generated by current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval owing to the lack of efficient temporal modeling and effective boundary context utilization. In this paper, we propose Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement. Specifically, we first design a Local-Global Temporal Encoder (LGTE), which adopts the channel grouping strategy to efficiently encode both "local and global" temporal inter-dependencies. Furthermore, both the boundary and internal context of proposals are adopted for framelevel and segment-level boundary regressions, respectively. Temporal Boundary Regressor (TBR) is designed to combine these two regression granularities in an end-to-end fashion, which achieves the precise boundaries and reliable confidence of proposals through progressive refinement. Extensive experiments are conducted on three challenging datasets: HACS, ActivityNet-v1.3, and THUMOS-

show abstract

Section: Related Workmentioning

confidence: 99%

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Qing¹,

Su²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The BERT model [4] further combined this property with attention-based selection scheme for the selftraining of language model. There are also some trials in vision problems to incorporate contextual modeling with action recognition [9,17,21,34], but mostly focus on the spatial and short-term context, while accurate atomic action detection requires both shortterm and long-term cues in spatiotemporal domain.…”

Section: Long-term Context Reasoningmentioning

confidence: 99%

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Zhang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we place the atomic action detection problem into a Long-Short Term Context (LSTC) to analyze how the temporal reliance among video signals affect the action detection results. To do this, we decompose the action recognition pipeline into shortterm and long-term reliance, in terms of the hypothesis that the two kinds of context are conditionally independent given the objective action instance. Within our design, a local aggregation branch is utilized to gather dense and informative short-term cues, while a high order long-term inference branch is designed to reason the objective action class from high-order interaction between actor and other person or person pairs. Both branches independently predict the context-specific actions and the results are merged in the end. We demonstrate that both temporal grains are beneficial to atomic action recognition. On the mainstream benchmarks of atomic action detection, our design can bring significant performance gain from the existing state-of-the-art pipeline. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

“…Video recognition methods that use an attention mechanism [2,5,24,29,38,40,52,53,56,58,61,62,65] have also been proposed [6,10,18,46,56,59,70]. Non-local neural networks [56], which are commonly used for introducing an attention mechanism, improve the accuracy of video recognition by capturing long-distance temporal dependency with a non-local operation capable of providing global information.…”

Section: Video Recognitionmentioning

confidence: 99%

ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Mitsuhara¹,

Hirakawa²,

Yamashita³

et al. 2021

Preprint

View full text Add to dashboard Cite

It is difficult for people to interpret the decision-making in the inference process of deep neural networks. Visual explanation is one method for interpreting the decisionmaking of deep learning. It analyzes the decision-making of 2D CNNs by visualizing an attention map that highlights discriminative regions. Visual explanation for interpreting the decision-making process in video recognition is more difficult because it is necessary to consider not only spatial but also temporal information, which is different from the case of still images. In this paper, we propose a visual explanation method called spatio-temporal attention branch network (ST-ABN) for video recognition. It enables visual explanation for both spatial and temporal information. ST-ABN acquires the importance of spatial and temporal information during network inference and applies it to recognition processing to improve recognition performance and visual explainability. Experimental results with Something-Something datasets V1 & V2 demonstrated that ST-ABN enables visual explanation that takes into account spatial and temporal information simultaneously and improves recognition performance.

show abstract

Action Recognition using Visual Attention

Cited by 117 publications

References 15 publications

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Contact Info

Product

Resources

About