Learning to recognise 3D human action from a new skeleton‐based representation using deep convolutional neural networks

Pham, Huy-Hieu; Khoudour, Louahdi; Crouzil, Alain; Zegers, Pablo; Velastín, Sergio A.

doi:10.1049/iet-cvi.2018.5014

Cited by 32 publications

(17 citation statements)

References 83 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These results demonstrate the effectiveness of the proposed representation and deep learning framework since they surpass previous state-of-the-art techniques such as Lie Group Representation [23], Hierarchical RNN [37], Dynamic Skeletons [93], Two-Layer P-LSTM [39], ST-LSTM Trust Gates [40], Geometric Features [74], Two-Stream RNN [91], Enhanced Skeleton [94], Lie Group Skeleton+CNN [95], and GCA-LSTM [92]. The experimental results have also shown that the proposed method leads to better overall action recognition performance than our previous models including Skeleton-based ResNet [51] and SPMF Inception-ResNet-222 [48]. With a high recognition rate on the Cross-View evaluation (86.82%) where the sequences provided by cameras 2 and 3 are used for training and sequences from camera 1 are used for test, the proposed method shows its effectiveness for dealing with view-independent action recognition problem.…”

Section: Experimental Results and Analysismentioning

confidence: 99%

“…This makes the training and inference processes much simpler and faster. Third, as shown in our previous works [48,51], the spatio–temporal dynamics of skeleton sequences can be transformed into color images—a kind of 3D tensor-structured representation that can be effectively learned by representation learning models as D-CNNs. Fourth, many different action classes share a great number of similar primitives, which interferes with action classification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Pham

Salmane

Khoudour

et al. 2019

Sensors

Self Cite

View full text Add to dashboard Cite

Designing motion representations for 3D human action recognition from skeleton sequences is an important yet challenging task. An effective representation should be robust to noise, invariant to viewpoint changes and result in a good performance with low-computational demand. Two main challenges in this task include how to efficiently represent spatio–temporal patterns of skeletal movements and how to learn their discriminative features for classification tasks. This paper presents a novel skeleton-based representation and a deep learning framework for 3D action recognition using RGB-D sensors. We propose to build an action map called SPMF (Skeleton Posture-Motion Feature), which is a compact image representation built from skeleton poses and their motions. An Adaptive Histogram Equalization (AHE) algorithm is then applied on the SPMF to enhance their local patterns and form an enhanced action map, namely Enhanced-SPMF. For learning and classification tasks, we exploit Deep Convolutional Neural Networks based on the DenseNet architecture to learn directly an end-to-end mapping between input skeleton sequences and their action labels via the Enhanced-SPMFs. The proposed method is evaluated on four challenging benchmark datasets, including both individual actions, interactions, multiview and large-scale datasets. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches on all benchmark tasks, whilst requiring low computational time for training and inference.

show abstract

Section: Experimental Results and Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Pham

Salmane

Khoudour

et al. 2019

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the same regard, Ke et al [27] propose to transform a skeleton sequence into three video clips, the CNN characteristics of the three clips are then merged into a single characteristics vector, which is finally sent to a softmax function for classification. Pham et al [28] propose to use a residual network [29] with the transformed normalized skeleton in the RGB space as the input. Cao et al [30] propose to classify the image obtained thanks to gated convolutions.…”

Section: Convolutional Neural Network (Cnn)mentioning

confidence: 99%

“…We then classify these images using standard computer vision deep-learning methods, such as in [27][28][29][30][31], while preserving spatial and temporal relationships.…”

Section: Cartesian Coordinates Features Branchmentioning

confidence: 99%

Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network

et al. 2020

View full text Add to dashboard Cite

Understanding the behaviors and intentions of humans is still one of the main challenges for vehicle autonomy. More specifically, inferring the intentions and actions of vulnerable actors, namely pedestrians, in complex situations such as urban traffic scenes remains a difficult task and a blocking point towards more automated vehicles. Answering the question “Is the pedestrian going to cross?” is a good starting point in order to advance in the quest to the fifth level of autonomous driving. In this paper, we address the problem of real-time discrete intention prediction of pedestrians in urban traffic environments by linking the dynamics of a pedestrian’s skeleton to an intention. Hence, we propose SPI-Net (Skeleton-based Pedestrian Intention network): a representation-focused multi-branch network combining features from 2D pedestrian body poses for the prediction of pedestrians’ discrete intentions. Experimental results show that SPI-Net achieved 94.4% accuracy in pedestrian crossing prediction on the JAAD data set while being efficient for real-time scenarios since SPI-Net can reach around one inference every 0.25 ms on one GPU (i.e., RTX 2080ti), or every 0.67 ms on one CPU (i.e., Intel Core i7 8700K).

show abstract

“…One of the major challenges in exploiting D-CNNs for skeleton-based action recognition is how a skeleton sequence could be effectively represented and fed to the deep networks. As D-CNNs work well on still images [18], our idea therefore is to encode the spatial and temporal dynamics of skeletons into 2D images [28,29]. Two essential elements for describing an action are static poses and their temporal dynamics.…”

Section: Enhanced Skeleton Pose-motion Featurementioning

confidence: 99%

A Deep Learning Approach for Real-Time 3D Human Action Recognition from Skeletal Data

Pham

Salmane

Khoudour

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

We present a new deep learning approach for real-time 3D human action recognition from skeletal data and apply it to develop a vision-based intelligent surveillance system. Given a skeleton sequence, we propose to encode skeleton poses and their motions into a single RGB image. An Adaptive Histogram Equalization (AHE) algorithm is then applied on the color images to enhance their local patterns and generate more discriminative features. For learning and classification tasks, we design Deep Neural Networks based on the Densely Connected Convolutional Architecture (DenseNet) to extract features from enhanced-color images and classify them into classes. Experimental results on two challenging datasets show that the proposed method reaches state-of-the-art accuracy, whilst requiring low computational time for training and inference. This paper also introduces CEMEST, a new RGB-D dataset depicting passenger behaviors in public transport. It consists of 203 untrimmed real-world surveillance videos of realistic normal and anomalous events. We achieve promising results on real conditions of this dataset with the support of data augmentation and transfer learning techniques. This enables the construction of real-world applications based on deep learning for enhancing monitoring and security in public transport.

show abstract

Learning to recognise 3D human action from a new skeleton‐based representation using deep convolutional neural networks

Cited by 32 publications

References 83 publications

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network

A Deep Learning Approach for Real-Time 3D Human Action Recognition from Skeletal Data

Contact Info

Product

Resources

About