CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Mao, Yunyao; Zhou, Wengang; Lu, Zhenbo; Deng, Jiajun; Li, Houqiang

doi:10.1007/978-3-031-20062-5_42

Cited by 17 publications

(30 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the field of skeleton-based action recognition, prior works (Li et al, 2021;Mao et al, 2022;Guo et al, 2022) proposed to apply contrastive learning in the pre-training stage by roughly following the frameworks mentioned above. CrossCLR (Li et al, 2021) mined positive pairs in the data space and explored the cross-modal distribution relationships.…”

Section: Contrastive Learningmentioning

confidence: 99%

“…CrossCLR (Li et al, 2021) mined positive pairs in the data space and explored the cross-modal distribution relationships. Further, CMD (Mao et al, 2022) transferred the cross-modal knowledge in a distillation manner. And AimCLR (Guo et al, 2022) used extreme augmentations to improve the representation universality.…”

Section: Contrastive Learningmentioning

confidence: 99%

“…Though there exist some works that apply contrastive learning in skeleton-based action recognition (Li et al, 2021;Guo et al, 2022;Mao et al, 2022), our method differs from them as follows:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Graph Contrastive Learning for Skeleton-based Action Recognition

Huang¹,

Zhang²,

Bao³

et al. 2023

Preprint

View full text Add to dashboard Cite

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still local since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeletonbased action recognition (SkeletonGCL) to explore the global context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and interclass dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. The source code will be available at https://github.com/OliverHxh/SkeletonGCL.

show abstract

Section: Contrastive Learningmentioning

confidence: 99%

Section: Contrastive Learningmentioning

confidence: 99%

See 1 more Smart Citation

Graph Contrastive Learning for Skeleton-based Action Recognition

Huang¹,

Zhang²,

Bao³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Following the previous related works [27,44], BiGRU is adopted as the encoder for a fair comparison. All sequence lengths are resized to the fixed 64 frames via temporal crop-resize [44].…”

Section: Implementation Details and Evaluationmentioning

confidence: 99%

“…The query encoder first completes the pre-training contrastive task on all unlabeled data. Then, the pre-trained encoder and linear classifier are fine-tuned on randomly sampled 1% and 10% labeled data Joint+Motion+Bone 77.8 3s-AimCLR [9] Joint+Motion+Bone 78.9 3s-HiCLR [55] Joint+Motion+Bone 80.4 3s-CrosSCLR-B [19] Joint+Motion+Bone 82.1 3s-CPM [54] Joint+Motion+Bone 83.2 3s-HiCo [7] Joint+Motion+Bone 83.8 3s-CMD [27] Joint+Motion+Bone 84.1 3s-A 2 MC Joint+Motion+Bone 84.6…”

Section: Implementation Details and Evaluationmentioning

confidence: 99%

Party controls in National Central University and Nanjing University before and after 1949

Xu¹

View full text Add to dashboard Cite

Contrastive learning, relying on effective positive and negative sample pairs, is beneficial to learn informative skeleton representations in unsupervised skeleton-based action recognition. To achieve these positive and negative pairs, existing weak/strong data augmentation methods have to randomly change the appearance of skeletons for indirectly pursuing semantic perturbations. However, such approaches have two limitations: 1) solely perturbing appearance cannot well capture the intrinsic semantic information of skeletons, and 2) randomly perturbation may change the original positive/negative pairs to soft positive/negative ones. To address the above dilemma, we start the first attempt to explore an attack-based augmentation scheme that additionally brings in direct semantic perturbation, for constructing hard positive pairs and further assisting in constructing hard negative pairs. In particular, we propose a novel Attack-Augmentation Mixing-Contrastive learning (A 2 MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations. In A 2 MC, Attack-Augmentation (Att-Aug) is designed to collaboratively perform targeted and untargeted perturbations of skeletons via attack and augmentation respectively, for generating high-quality hard positive features. Meanwhile, Positive-Negative Mixer (PNM) is presented to mix hard positive features and negative features for generating hard negative features, which are adopted for updating the mixed memory banks. Extensive experiments on three public datasets demonstrate that A 2 MC is competitive with the state-of-the-art methods.

show abstract

MCDGait: multimodal co-learning distillation network with spatial-temporal graph reasoning for gait recognition in the wild

Xiong,

Zou,

Tang

et al. 2024

Vis Comput

View full text Add to dashboard Cite

Gait recognition in the wild has attracted the attention of the academic community. However, existing unimodal algorithms cannot achieve the same performance on in-the-wild datasets as in-the-lab datasets because unimodal data have many limitations in-the-wild environments. Therefore, we propose a multimodal approach combining silhouettes and skeletons and formulates the multimodal gait recognition problem as a multimodal co-learning problem. In particular, we propose a multimodal co-learning distillation network (MCDGait) that integrates two student networks processing unimodal data into a single teacher network. Based on the semantic consistency of different modalities and the paradigm of deep mutual learning, the performance of the entire network is continuously improved via the bidirectional knowledge distillation between the student and teacher networks. Inspired by the observation that specific body parts or joints exhibit unique motion characteristics and have linkage with other parts or joints during walking, we propose a spatial-temporal graph reasoning module (ST-GRM). This module represents the parts or joints as graph nodes and the motion linkages between them as edges. By utilizing dynamic graph generator, the module implicitly captures the dynamic changes of the human body. Based on the generated graphs, the independent spatial-temporal linkage feature of each part and the interactive spatial-temporal linkage feature are aggregated simultaneously. Extensive experiments conducted on two in-the-wild datasets demonstrate the state-of-the-art performance of the proposed method. The average rank-1 accuracy on datasets Gait3D and GREW is 50.90% and 58.06%, respectively. The source code can be obtained from https://github.com/BoyeXiong/MCDGait.

show abstract

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Cited by 17 publications

References 47 publications

Graph Contrastive Learning for Skeleton-based Action Recognition

Graph Contrastive Learning for Skeleton-based Action Recognition

Party controls in National Central University and Nanjing University before and after 1949

MCDGait: multimodal co-learning distillation network with spatial-temporal graph reasoning for gait recognition in the wild

Contact Info

Product

Resources

About