Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

Shahroudy, Amir; Ng, Tian-Tsong; Gong, Yi; Wang, Gang

doi:10.1109/tpami.2017.2691321

Cited by 210 publications

(120 citation statements)

References 73 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…As previously shown [30], deep Conv-Deconv structures are hard to converge. To improve the convergence, we adopted the “U-Net” [27] to add extra connections between the associated convolutional and de-convolutional layers.…”

Section: Methodsmentioning

confidence: 85%

See 1 more Smart Citation

Region-based Activity Recognition Using Conditional GAN

Zhang

et al. 2017

Proceedings of the 25th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We present a method for activity recognition that first estimates the activity performer’s location and uses it with input data for activity recognition. Existing approaches directly take video frames or entire video for feature extraction and recognition, and treat the classifier as a black box. Our method first locates the activities in each input video frame by generating an activity mask using a conditional generative adversarial network (cGAN). The generated mask is appended to color channels of input images and fed into a VGG-LSTM network for activity recognition. To test our system, we produced two datasets with manually created masks, one containing Olympic sports activities and the other containing trauma resuscitation activities. Our system makes activity prediction for each video frame and achieves performance comparable to the state-of-the-art systems while simultaneously outlining the location of the activity. We show how the generated masks facilitate the learning of features that are representative of the activity rather than accidental surrounding information.

show abstract

Section: Methodsmentioning

confidence: 85%

“…Recurrent neural networks and long-short-term memory networks (LSTMs) also were used for modeling long-range temporal associations. The ConvNet-LSTM structure was used for activity recognition with different types of input (RGB video, mobile sensor data) [16, 30, 35]. …”

Section: Related Workmentioning

confidence: 99%

Region-based Activity Recognition Using Conditional GAN

Zhang

et al. 2017

Proceedings of the 25th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…An intuitive way to combine multimodal features is to directly concatenate them together . To mine more useful information among multimodal features for better performance, researchers propose to explicitly learn shared‐specific structures among features …”

Section: Related Workmentioning

confidence: 99%

“…During the process of encoding, the informative part of information can be merged for a higher distinctive representation, which has been proved effective by many methods and applications. 11,62 To speed up fuse operation, we argue that "prefused" weights could be directly used as initializations for the AE network, due to goal consistency, ie, assigning labels to human actions, between former steps and the fusing step. Specifically, we adopt a pretrained fully connected network accompanied with a small data set D, which not only assigns initial weights In fact, the idea of adopting a fully connected network to settle initial parameters is similar to the spirit of a fully connected layer of LSTM, which successfully helps in transforming the initial weighting process into being one fully connected layer.…”

Section: Fusion Of Heterogeneous Features By Ae Networkmentioning

confidence: 99%

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Wei²,

Duan

2019

Computational Intelligence

View full text Add to dashboard Cite

With the rapid development of RGB‐D cameras and pose estimation techniques, action recognition based on three‐dimensional skeleton data has gained significant attention in the artificial intelligence community. In this paper, we incorporate temporal pattern descriptors of joint positions with the currently popular long short‐term memory (LSTM)–based learning scheme to obtain accurate and robust action recognition. Considering that actions are essentially formed by small subactions, we first utilize a two‐dimensional wavelet transform to extract temporal pattern descriptors in the frequency domain for each subaction. Afterward, we design a novel LSTM structure to extract deep features, which model a long‐term spatiotemporal correlation between body parts. Since temporal pattern descriptors and LSTM deep features can be regarded as multimodal representations for actions, we fuse them with an autoencoder network to achieve a more effective feature descriptor for action recognition. Experimental results on three challenging data sets with several comparative methods demonstrate the effectiveness of the proposed method for three‐dimensional action recognition.

show abstract

“…Shahroudy et al [7] proposed a shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components. This network achieved much higher accuracy in action recognition of RGB+D videos, but the result is not ideal enough because of the poor performance of the RGB based features for the cross-view task.…”

Section: Introductionmentioning

confidence: 99%

Research of Action Recognition Methods Based on RGB+D Videos

Huang¹,

Chen²

2018

Proceedings of the 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018)

View full text Add to dashboard Cite

Abstract-In order to solve the problem on making full use of RGB+D dataset that includes RGB data, 3D skeletal data, depth map sequences and infrared videos, this paper proposes an action recognition method of RGB+D videos that merges a multilayer recurrent neural network and two-stream convolutional networks, combining RGB information and joints information together. Simulation results show that the multi-layer recurrent network proposed in this paper has better performance than other recurrent networks when dealing with the skeletal data. Moreover, by combining it with the spatial network or temporal network through nonlinear weighted score fusion, the recognition accuracy is further improved. The cross-view action recognition accuracy is improved to be 0.79%, 5.6%, 20.62% and 23.65% higher than the original method, respectively by using the multilayer network alone, combining the multi-layer network and spatial network, combining the multi-layer network and temporal network, and combining three networks together.

show abstract

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

Cited by 210 publications

References 73 publications

Region-based Activity Recognition Using Conditional GAN

Region-based Activity Recognition Using Conditional GAN

Deep spatiotemporal LSTM network with temporal pattern feature for 3D human action recognition

Research of Action Recognition Methods Based on RGB+D Videos

Contact Info

Product

Resources

About