Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Wu, Wenhao; He, Dongliang; Tan, Xiao; Chen, Shifeng; Wen, Shifeng

doi:10.1109/iccv.2019.00632

Cited by 137 publications

(130 citation statements)

References 54 publications

Supporting

Mentioning

126

Contrasting

Order By: Relevance

“…NetVlad [1], ActionVlad [8], AttentionClusters [19] are proposed for better local feature integration instead of directly average pooling as used. MARL [31] uses multiple agents as frame selectors instead of the general uniform sampler from the entire video for better global temporal modeling. Each agent learns a exibly moving policy through the temporal axis to get a vital representation frame and other agents' behavior as well.…”

Section: Long-term Modeling For Video Recognitionmentioning

confidence: 99%

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Lin

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(%), achieving a relative improvement of 21.7%. Favorable result is also obtained on ActivityNet. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding.

show abstract

Section: Long-term Modeling For Video Recognitionmentioning

confidence: 99%

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Lin

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, there are few for activity recognition especially for skeleton-based data. In [22], multiagent reinforcement learning is used to select key frames in videos where each agent is responsible for selecting a frame. As a result, the number of selected frames is fixed.…”

Section: B Reinforcement Learning In Activity Recognitionmentioning

confidence: 99%

Joint Selection using Deep Reinforcement Learning for Skeleton-based Activity Recognition

Nikpour¹,

Armanfard²

2021

Preprint

View full text Add to dashboard Cite

<div>Skeleton based human activity recognition has attracted lots of attention due to its wide range of applications. Skeleton data includes two or three dimensional coordinates of body joints. All of the body joints are not effective in recognizing different activities, so finding key joints within a video and across different activities has a significant role in improving the performance. In this paper we propose a novel framework that performs joint selection in skeleton video frames for the purpose of human activity recognition. To this end, we formulate the joint selection problem as a Markov Decision Process (MDP) where we employ deep reinforcement learning to find the most informative joints per frame. The proposed joint selection method is a general framework that can be employed to improve human activity classification methods. Experimental results on two benchmark activity recognition data sets using three different classifiers demonstrate effectiveness of the proposed joint selection method.</div>

show abstract

“…For action recognition, Dong et al [ 62 ] proposed an attention-aware sampling agent based on deep reinforcement learning to select the most discriminative frame in the inference step to improve performance. Wu et al [ 63 ] proposed a frame sampling agent based on multiagent reinforcement learning to drop non-informative frames of untrimmed video. Zheng et al [ 64 ] used reinforcement learning agents to select effective segments for inference.…”

Section: Related Workmentioning

confidence: 99%

ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Zhang

Xiong

et al. 2021

Sensors

View full text Add to dashboard Cite

Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

show abstract

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Cited by 137 publications

References 54 publications

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Joint Selection using Deep Reinforcement Learning for Skeleton-based Activity Recognition

ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Contact Info

Product

Resources

About