2021 IEEE International Conference on Multimedia and Expo (ICME) 2021
DOI: 10.1109/icme51207.2021.9428459
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition

Abstract: The unsupervised representation learning for skeleton-based human action can be utilized in a variety of pose analysis applications. However, previous unsupervised methods focus on modeling the temporal dependencies in sequences, but take less effort in modeling the spatial structure in human action. To this end, we propose a novel unsupervised learning framework named Hierarchical Transformer for skeleton-based human action recognition. The Hierarchical Transformer consists of hierarchically aggregated self-a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 38 publications
(23 citation statements)
references
References 20 publications
(29 reference statements)
0
23
0
Order By: Relevance
“…Cheng et al [247] presented a hierarchical Transformer for unsupervised skeleton-based HAR, along with a motion prediction pre-training task between adjacent frames to learn discriminative representations. Liu et al [248] proposed a Kernel Attention Adaptive Graph Transformer Network (KA-AGTN), which is mainly composed of a skeleton graph transformer block to effectively capture the varying degrees of higher-order dependencies among joints, a temporal kernel attention module, and an adaptive graph strategy.…”
Section: Transformer-based Methodsmentioning
confidence: 99%
“…Cheng et al [247] presented a hierarchical Transformer for unsupervised skeleton-based HAR, along with a motion prediction pre-training task between adjacent frames to learn discriminative representations. Liu et al [248] proposed a Kernel Attention Adaptive Graph Transformer Network (KA-AGTN), which is mainly composed of a skeleton graph transformer block to effectively capture the varying degrees of higher-order dependencies among joints, a temporal kernel attention module, and an adaptive graph strategy.…”
Section: Transformer-based Methodsmentioning
confidence: 99%
“…In the activity recognition area, Trear [41] proposes a transformer-based RGB-D egocentric activity recognition framework by adapting self-attention to model temporal structure from different modalities. Besides, action-transformer [42], motiontransformer [43], hierarchical-transformer [44], spatial temporal transformer network [45] and STST [46] are designed for skeleton-based activity recognition, modeling temporaland spatial dependencies in the skeleton sequences. MM-ViT [47] factorizes self-attention across the space, time, and modality dimensions, operating in the compressed video domain and exploiting various modalities.…”
Section: Related Workmentioning
confidence: 99%
“…In NLP, masked language modelling is a pretraining technique where randomly masked tokens in the input are predicted. This approach has been explored for action recognition (Cheng et al 2021), where certain frames are masked and a regression task estimates coordinates of keypoints. In addition, a direction loss is also proposed to classify the quadrant where the motion vector lies.…”
Section: Masking-based Pretrainingmentioning
confidence: 99%
“…We follow the same hyperparameter settings as described in Motion-Transformer (Cheng et al 2021) Table 5: Effectiveness of pretraining strategies as measured on ISLR accuracy on INCLUDE random masking of 40% of the input frames. When using only the regression loss, we find that pretraining learns to reduce the loss as shown in Figure 3.…”
Section: Masking-based Pretrainingmentioning
confidence: 99%