Adaptive RNN Tree for Large-Scale Human Action Recognition

Li, Wenbo; Wen, Longyin; Chang, Ming-Ching; Lim, Ser-Nam; Lyu, Siwei

doi:10.1109/iccv.2017.161

Cited by 102 publications

(59 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For skeleton-based action recognition, the study of viewpoint influences is under-explored. To be view-invariant, the commonly used strategies are to perform pre-processing of skeletons [6], [17], [25], [27], [29], [38], [42], [45], [55], [62]. Unfortunately, frame-level pre-processing, where each frame is transformed to the body center with the upperbody orientation aligned, usually results in the partial loss of relative motion information.…”

Section: Viewpoints In Skeleton-based Action Recognitionmentioning

confidence: 99%

“…Similarly, Liu et al [29] use both global contextual information and local information to selectively focus on informative joints. Li et al [25] combine tree-like hierarchy RNNs with action category hierarchy to distinguish easy-tell actions in the low levels of networks and hard-tell actions in the high levels of networks. Different from the above works, we enhance the recognition performance from a new perspective.…”

Section: Rnn For Skeleton-based Action Recognitionmentioning

confidence: 99%

“…Some approaches use human defined criteria to pre-process the skeletons to reduce the challenges caused by view variations [6], [25], [27], [29], [38], [42], [62]. We make comparison on the effectiveness of those strategies and our view adaptation model.…”

Section: Comparison With Other Pre-processing Strategiesmentioning

confidence: 99%

See 2 more Smart Citations

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

Zhang

Lan

Xing

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

413

234

View full text Add to dashboard Cite

Skeleton-based human action recognition has recently attracted increasing attention thanks to the accessibility and the popularity of 3D skeleton data. One of the key challenges in action recognition lies in the large variations of action representations when they are captured from different viewpoints. In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner. Instead of re-positioning the skeletons using a fixed human-defined prior criterion, we design two view adaptive neural networks, i.e., VA-RNN and VA-CNN, which are respectively built based on the recurrent neural network (RNN) with the Long Short-term Memory (LSTM) and the convolutional neural network (CNN). For each network, a novel view adaptation module learns and determines the most suitable observation viewpoints, and transforms the skeletons to those viewpoints for the end-to-end recognition with a main classification network. Ablation studies find that the proposed view adaptive models are capable of transforming the skeletons of various views to much more consistent virtual viewpoints. Therefore, the models largely eliminate the influence of the viewpoints, enabling the networks to focus on the learning of action-specific features and thus resulting in superior performance. In addition, we design a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance. Moreover, random rotation of skeleton sequences is employed to improve the robustness of view adaptation models and alleviate overfitting during training. Extensive experimental evaluations on five challenging benchmarks demonstrate the effectiveness of the proposed view-adaptive networks and superior performance over state-of-the-art approaches. The source code is available at https://github.com/microsoft/View-Adaptive-Neural-Networks-for-Skeleton-based-Human-Action-Recognition. Fig. 1: Skeleton representations of the same posture captured from different viewpoints (different camera position and angle) are very different. distractions, and variations of viewpoints [2], [10], [33], [57]. Biological observations from the early seminal work of Johansson [18] suggest that humans are capable of recognizing actions from the motion of just a few joints of the human body, even without appearance information. The prevalence of cost-effective depth cameras such as Microsoft Kinect [59], Intel RealSense [1], dual camera devices, and the advance of powerful techniques for human pose estimation

show abstract

Section: Viewpoints In Skeleton-based Action Recognitionmentioning

confidence: 99%

Section: Rnn For Skeleton-based Action Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

Zhang

Lan

Xing

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

413

234

View full text Add to dashboard Cite

show abstract

“…The created architecture is combined with a 3-dimensional ConvNet by using a two-stream fusion of the RNN and ConvNet, with an SVM. The use of multiple recurrent networks has also been scaled to include tree structures (RNN-T) [76], to perform a hierarchical recognition process in which each RNN is responsible for learning an action instance based on an Action Category Hierarchy (ACH). This allows for the distinction between very dissimilar classes high in the hierarchy, while subtle differences between related classes such as a handshake and a fist bump are dealt with in the lower nodes.…”

Section: Recurrent Networkmentioning

confidence: 99%

Analyzing human–human interactions: A survey

Stergiou

Poppe

2019

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording settings, the appearance of the people depicted and the performance of their interaction. This survey provides a summary of these challenges and datasets, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art. Main challenges in the fieldWe identify challenges when dealing with the visual and structural aspects of interaction videos. Additionally, we outline practical challenges in the development of methods of automated human-human action recognition.

show abstract

“…Liu et al propose a spatio-temporal LSTM structure to explore the contextual dependency of joints in spatio-temporal domains [36]. Li et al propose an RNN tree network with a hierarchical structure which classifies the action classes that are easier to distinguish at the lower layers and the action classes that are harder to distinguish at higher layers [37]. To address the large view variation of the captured data, Zhang et al propose a view adaptive subnetwork which automatically selects the best observation viewpoints within an end-to-end network for recognition [38].…”

Section: Action Recognitionmentioning

confidence: 99%

Adding Attentiveness to the Neurons in Recurrent Neural Networks

Zhang

Xue

Lan

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recurrent neural networks (RNNs) are capable of modeling the temporal dynamics of complex sequential information. However, the structures of existing RNN neurons mainly focus on controlling the contributions of current and historical information but do not explore the different importance levels of different elements in an input vector of a time slot. We propose adding a simple yet effective Element-wise-Attention Gate (EleAttG) to an RNN block (e.g., all RNN neurons in a network layer) that empowers the RNN neurons to have the attentiveness capability. For an RNN block, an EleAttG is added to adaptively modulate the input by assigning different levels of importance, i.e., attention, to each element/dimension of the input. We refer to an RNN block equipped with an EleAttG as an EleAtt-RNN block. Specifically, the modulation of the input is content adaptive and is performed at fine granularity, being element-wise rather than input-wise. The proposed EleAttG, as an additional fundamental unit, is general and can be applied to any RNN structures, e.g., standard RNN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). We demonstrate the effectiveness of the proposed EleAtt-RNN by applying it to the action recognition tasks on both 3D human skeleton data and RGB videos. Experiments show that adding attentiveness through EleAttGs to RNN blocks significantly boosts the power of RNNs.

show abstract

Adaptive RNN Tree for Large-Scale Human Action Recognition

Cited by 102 publications

References 24 publications

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition

Analyzing human–human interactions: A survey

Adding Attentiveness to the Neurons in Recurrent Neural Networks

Contact Info

Product

Resources

About