A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Asadi-Aghbolaghi, Maryam; Clapés, Albert; Bellantonio, Marco; Escalante, Hugo Jair; Ponce-López, Víctor; Baró, Xavier; Guyon, Isabelle; Kasaei, Shohreh; Escalera, Sérgio

doi:10.1109/fg.2017.150

Cited by 165 publications

(92 citation statements)

References 97 publications

(128 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our method is inspired from [29]. Layer 2 extracts the local phase spectra of f (x) by computing the 3D Short Term Fourier Transform (STFT) in a local n×n×n neighborhood N x at each position x of f (x) using Equation 1.…”

Section: Methodsmentioning

confidence: 99%

LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks

Kumawat

Raman

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Traditional 3D Convolutional Neural Networks (CNNs) are computationally expensive, memory intensive, prone to overfit, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose Rectified Local Phase Volume (ReLPV) block, an efficient alternative to the standard 3D convolutional layer. The ReLPV block extracts the phase in a 3D local neighborhood (e.g., 3 × 3 × 3) of each position of the input map to obtain the feature maps. The phase is extracted by computing 3D Short Term Fourier Transform (STFT) at multiple fixed low frequency points in the 3D local neighborhood of each position. These feature maps at different frequency points are then linearly combined after passing them through an activation function. The ReLPV block provides significant parameter savings of at least, 3 3 to 13 3 times compared to the standard 3D convolutional layer with the filter sizes 3 × 3 × 3 to 13 × 13 × 13, respectively. We show that the feature learning capabilities of the ReLPV block are significantly better than the standard 3D convolutional layer. Furthermore, it produces consistently better results across different 3D data representations. We achieve state-of-the-art accuracy on the volumetric ModelNet10 and ModelNet40 datasets while utilizing only 11% parameters of the current state-of-theart. We also improve the state-of-the-art on the UCF-101 split-1 action recognition dataset by 5.68% (when trained from scratch) while using only 15% of the parameters of the state-of-the-art. The project webpage is available at

show abstract

Section: Methodsmentioning

confidence: 99%

LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks

Kumawat

Raman

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…CNNs have been used to classify actions and interactions in single frames [4,9,42]. Similar to the use of handcrafted features (Section 3), the focus is on a characteristic joint pose.…”

Section: Single Frame Networkmentioning

confidence: 99%

Analyzing human–human interactions: A survey

Stergiou

Poppe

2019

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording settings, the appearance of the people depicted and the performance of their interaction. This survey provides a summary of these challenges and datasets, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art. Main challenges in the fieldWe identify challenges when dealing with the visual and structural aspects of interaction videos. Additionally, we outline practical challenges in the development of methods of automated human-human action recognition.

show abstract

“…During the back-propagation, due to its property of differentiability, it updates the gradient. The corresponding mask gradient of the input feature in the soft mask layer is as shown in equation (2). If the trunk features T are not correct, mask can prevent [54] T features to update the parameters as there is a multiplication factor of the mask M with partial derivative of T as shown in equation (2).…”

Section: D Residual Attention Networkmentioning

confidence: 99%

Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos

Dhingra

Kunz

2019

2019 International Conference on 3D Vision (3DV)

View full text Add to dashboard Cite

Hand gesture recognition is a strenuous task to solve in videos. In this paper, we use a 3D residual attention network which is trained end to end for hand gesture recognition. Based on the stacked multiple attention blocks, we build a 3D network which generates different features at each attention block. Our 3D attention based residual network (Res3ATN) can be built and extended to very deep layers. Using this network, an extensive analysis is performed on other 3D networks based on three publicly available datasets. The Res3ATN network performance is compared to C3D, ResNet-10, and ResNext-101 networks. We study and evaluate our baseline network with different number and position of attention blocks. The comparison shows that the 3D residual attention network with 3 attention blocks is robust in attention learning and can classify the gestures with better accuracy, thus outperforming existing networks.

show abstract

A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences

Cited by 165 publications

References 97 publications

LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks

LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks

Analyzing human–human interactions: A survey

Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos

Contact Info

Product

Resources

About