Skeleton-Contrastive 3D Action Representation Learning

Thoker, Fida Mohammad; Doughty, Hazel; Snoek, Cees G. M.

doi:10.1145/3474085.3475307

Cited by 77 publications

(39 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3S means the ensemble results of joint, bone and motion data. The obvious performance improvement compared with the recent advanced unsupervised counterparts [14,33] has been obtained and demonstrates the effectiveness of CPM. In addition, CPM (3S) outperforms the supervised ST-GCN [39] on both NTU and PKU-MMD datasets.…”

Section: Results and Comparisonmentioning

confidence: 73%

See 1 more Smart Citation

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Zhang¹,

Hou²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent contrastive based 3D action representation learning has made great progress. However, the strict positive/negative constraint is yet to be relaxed and the use of non-self positive is yet to be explored. In this paper, a Contrastive Positive Mining (CPM) framework is proposed for unsupervised skeleton 3D action representation learning. The CPM identifies non-self positives in a contextual queue to boost learning. Specifically, the siamese encoders are adopted and trained to match the similarity distributions of the augmented instances in reference to all instances in the contextual queue. By identifying the non-self positive instances in the queue, a positive-enhanced learning strategy is proposed to leverage the knowledge of mined positives to boost the robustness of the learned latent space against intra-class and inter-class diversity. Experimental results have shown that the proposed CPM is effective and outperforms the existing state-of-the-art unsupervised methods on the challenging NTU and PKU-MMD datasets.

show abstract

Section: Results and Comparisonmentioning

confidence: 73%

“…The results have shown the proposed CPM performs significantly better than the compared methods. Compared with MS 2 L [15] and ISC [33], CPM improves the performance by a large margin and shows its robustness when fewer labels are available for fine-tuning.…”

Section: Architecturesmentioning

confidence: 99%

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Zhang¹,

Hou²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…X-Set(%) X-Sub(%) LongT GAN [43] 39.7 35.6 PCRP [45] 45.1 41.7 AS-CAL [46] 49.2 48.6 CRRL [48] 57.0 56.2 3s-CrossSCLR [47] 66.7 67.9 3s-AimCLR [49] 68.8 68.2 ISC [58] 67.1 67.9 BRL [57] 79.2 77.1 ConGT 80.5 78.6…”

Section: Methodsmentioning

confidence: 99%

Skeleton-Based Action Recognition Through Contrasting Two-Stream Spatial-Temporal Networks

Pang

Lyu

2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action "clapping hands"). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-ofthe-art performance in action recognition.

show abstract

“…Yang et al (2021b) design a novel skeleton cloud colorization technique to learn skeleton representations. AS-CAL (Rao et al 2021) and SkeletonCLR (Li et al 2021) use momentum encoder for contrastive learning with single-stream skeleton sequence while CrosSCLR (Li et al 2021) proposes cross-stream knowledge mining strategy to improve the performance and ISC (Thoker, Doughty, and Snoek 2021) proposes inter-skeleton contrastive learning to learn from multiple different input skeleton representations. In order to learn more general features, MS 2 L (Lin et al 2020) introduces multiple self-supervised tasks to learn more general representations.…”

Section: Related Workmentioning

confidence: 99%

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

Guo

Liu

Chen

et al. 2022

AAAI

View full text Add to dashboard Cite

In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D3M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https://github.com/Levigty/AimCLR.

show abstract

Skeleton-Contrastive 3D Action Representation Learning

Cited by 77 publications

References 46 publications

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Contrastive Positive Mining for Unsupervised 3D Action Representation Learning

Skeleton-Based Action Recognition Through Contrasting Two-Stream Spatial-Temporal Networks

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

Contact Info

Product

Resources

About