2022
DOI: 10.3390/app12126215
|View full text |Cite
|
Sign up to set email alerts
|

Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

Abstract: Incorporating multi-modality data is an effective way to improve action recognition performance. Based on this idea, we investigate a new data modality in which Whole-Body Keypoint and Skeleton (WKS) labels are used to capture refined body information. Unlike directly aggregated multi-modality, we leverage distillation to adapt an RGB network to classify action with the feature-extraction ability of the WKS network, which is only fed with RGB clips. Inspired by the success of transformers for vision tasks, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 50 publications
0
1
0
Order By: Relevance
“…Video recognition [29][30][31] is an important direction in computer vision and video processing research. To date, many effective video recognition methods have been developed, which can be grouped into two categories: temporal-spatial-and spatial-based video recognition methods.…”
Section: Discussionmentioning
confidence: 99%
“…Video recognition [29][30][31] is an important direction in computer vision and video processing research. To date, many effective video recognition methods have been developed, which can be grouped into two categories: temporal-spatial-and spatial-based video recognition methods.…”
Section: Discussionmentioning
confidence: 99%
“…Guo, Z. et al (2022) [3] suggested a method for capturing detailed body information using Whole-Body Keypoint and Skeleton (WKS) labels. They developed an architecture that achieves superior performance by utilizing the Swin transformer when combined with three-dimensional (3D) convolutional neural networks (CNNs) to extract spatiotemporal characteristics.…”
Section: Related Workmentioning
confidence: 99%
“…The initial model structure was specifically designed for video data, incorporating multiple 3D convolutional layers to capture both spatial and temporal features effectively. The input shape here is (25,224,224,3) which represents (the number of frames per sequence, image height, image width, and color channel). At a certain point, it could predict all the training datasets.…”
Section: Proposed Model Evaluation 61 Experimental 3dcnn Modelmentioning
confidence: 99%