Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

Jung

et al. 2022

Sensors

Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.

Section: Resultsmentioning

confidence: 99%

Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition

Jung

et al. 2022

Sensors

“…In the field of skeleton-based action recognition, prior works (Li et al, 2021;Mao et al, 2022;Guo et al, 2022) proposed to apply contrastive learning in the pre-training stage by roughly following the frameworks mentioned above. CrossCLR (Li et al, 2021) mined positive pairs in the data space and explored the cross-modal distribution relationships.…”

Section: Contrastive Learningmentioning

confidence: 99%

“…Further, CMD (Mao et al, 2022) transferred the cross-modal knowledge in a distillation manner. And AimCLR (Guo et al, 2022) used extreme augmentations to improve the representation universality.…”

Section: Contrastive Learningmentioning

confidence: 99%

See 1 more Smart Citation

Graph Contrastive Learning for Skeleton-based Action Recognition

Huang¹,

Zhang²,

Bao³

et al. 2023

Preprint

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still local since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeletonbased action recognition (SkeletonGCL) to explore the global context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and interclass dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. The source code will be available at https://github.com/OliverHxh/SkeletonGCL.

“…Wang et al [48] proposed the contrast-reconstruction representation learning network to capture postures and motion dynamics simultaneously. In [49], Guo et al utilized the abundant information mining strategy to make better use of the movement patterns. In [50], [51], it is suggested that contrasting congruent and incongruent views of graphs with mutual information maximization can help encode rich representations.…”

Section: Self-supervised Learningmentioning

confidence: 99%

Skeleton-Based Action Recognition Through Contrasting Two-Stream Spatial-Temporal Networks

Pang

Lyu

2023

IEEE Trans. Multimedia

For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action "clapping hands"). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-ofthe-art performance in action recognition.