3D CNNs on Distance Matrices for Human Action Recognition

Ruiz, Alejandro Hernandez; Porzi, Lorenzo; Bulò, Samuel Rota; Moreno-Noguer, Francesc

doi:10.1145/3123266.3123299

Cited by 29 publications

(11 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sequence-based treats the 3D-skeleton data as a multi-dimensional time-series and models it with a recurrent architecture [21,22,32,35,46] to learn the temporal dynamics of the joints. Image-based create a pseudo-image representation of the 3D-skeleton data [7,12,17,23,38] which is encoded by CNN architectures to model the co-occurrence of multiple joints and their motion. Finally, graph-based [4,13,18,24,31,33,37,44] represents the 3D-skeleton data with a graph consisting of spatial and temporal edges.…”

Section: Related Workmentioning

confidence: 99%

“…robust to changes in background and appearance [23,46]. However, learning a good feature space for 3D actions requires large amounts of labeled skeleton data [7,12,35,36,[44][45][46], which is much harder to obtain than large amounts of labeled RGB video. To address this major shortcoming, we propose a new self-supervised contrastive learning method for 3D skeleton data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Skeleton-Contrastive 3D Action Representation Learning

Thoker

Doughty

Snoek

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast. CCS CONCEPTS• Computing methodologies → Activity recognition.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Skeleton-Contrastive 3D Action Representation Learning

Thoker

Doughty

Snoek

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…RNN-based methods [6,8,9,10] aim to capture the temporal dependency of skeleton data and have achieved remarkable performance than manually designed features. CNNbased models [11,12] are also proposed to extract spatial and temporal information by applying convolution in both 3D skeletons and sequences. Recently GCN-based models [13] have been favored for the fine-grained modeling of the spatial structure by using graph representation, and have achieved more impressive performance.…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition

Cheng

Chen

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

The unsupervised representation learning for skeleton-based human action can be utilized in a variety of pose analysis applications. However, previous unsupervised methods focus on modeling the temporal dependencies in sequences, but take less effort in modeling the spatial structure in human action. To this end, we propose a novel unsupervised learning framework named Hierarchical Transformer for skeleton-based human action recognition. The Hierarchical Transformer consists of hierarchically aggregated self-attention modules for better capturing the spatial and temporal structure in the skeleton sequences. Furthermore, we propose to predict the motion between adjacent frames as a novel pre-training task for better capturing the long-term dependencies in sequences. Experimental results show that our method outperforms prior state-of-the-art unsupervised methods on NTU RGB+D and NW-UCLA datasets. Besides, our method also achieves stateof-the-art performance when the pre-trained model is transferred to SBU dataset, which demonstrates the generalizability of learned representation.

show abstract

“…The challenge with CNN based methods is the extraction and utilization of spatial as well as temporal information from 3D skeleton sequences. Several other problems hinder these techniques including model size and speed [45], occlusions, CNN architecture definition [30], and viewpoint variations [47]. Skeleton based action recognition using CNNs thus remains a not completely solved research question.…”

Section: Action Recognitionmentioning

confidence: 99%

Geometric Deep Neural Network using Rigid and Non-Rigid Transformations for Human Action Recognition

Friji

Drira²,

Chaieb

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Deep Learning architectures, albeit successful in most computer vision tasks, were designed for data with an underlying Euclidean structure, which is not usually fulfilled since pre-processed data may lie on a non-linear space. In this paper, we propose a geometry aware deep learning approach using rigid and non rigid transformation optimization for skeleton-based action recognition. Skeleton sequences are first modeled as trajectories on Kendall's shape space and then mapped to the linear tangent space. The resulting structured data are then fed to a deep learning architecture, which includes a layer that optimizes over rigid and non rigid transformations of the 3D skeletons, followed by a CNN-LSTM network. The assessment on two large scale skeleton datasets, namely NTU-RGB+D and NTU-RGB+D 120, has proven that the proposed approach outperforms existing geometric deep learning methods and exceeds recently published approaches with respect to the majority of configurations.

show abstract

3D CNNs on Distance Matrices for Human Action Recognition

Cited by 29 publications

References 32 publications

Skeleton-Contrastive 3D Action Representation Learning

Skeleton-Contrastive 3D Action Representation Learning

Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition

Geometric Deep Neural Network using Rigid and Non-Rigid Transformations for Human Action Recognition

Contact Info

Product

Resources

About