2019
DOI: 10.1177/1729881418825093
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks

Abstract: Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial-temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneousl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
18
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(18 citation statements)
references
References 43 publications
(85 reference statements)
0
18
0
Order By: Relevance
“…The total average accuracy after testing both models on the NTU RGB+D dataset is 75.26% (CS) and 75.45% (CV) for the stateless ConvLSTM network and 80.43% (CS) and 79.91% (CV) for the stateful ConvLSTM network. This proves that, although it is Method CS CV Modality: 3D Skeleton ST-LSTM + Trust Gate (2016) [39] 69.2 77.7 Clips + CNN + MTLN (2017) [19] 79.57 84.83 AGC-LSTM (2019) [42] 89.2 95.0 Modality: Depth Unsupervised ConvLSTM (2017) [31] 66.2 -Dynamic images (HRP) (2018) [58] 87.08 84.22 HDDPDI (2019) [65] 82.43 87.56 Multi-view dynamic images (2019) [66] 84 rarely used in the literature, stateful mode of the conventional LSTM is able to improve dramatically its performance on challenging datasets like NTU RGB+D.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The total average accuracy after testing both models on the NTU RGB+D dataset is 75.26% (CS) and 75.45% (CV) for the stateless ConvLSTM network and 80.43% (CS) and 79.91% (CV) for the stateful ConvLSTM network. This proves that, although it is Method CS CV Modality: 3D Skeleton ST-LSTM + Trust Gate (2016) [39] 69.2 77.7 Clips + CNN + MTLN (2017) [19] 79.57 84.83 AGC-LSTM (2019) [42] 89.2 95.0 Modality: Depth Unsupervised ConvLSTM (2017) [31] 66.2 -Dynamic images (HRP) (2018) [58] 87.08 84.22 HDDPDI (2019) [65] 82.43 87.56 Multi-view dynamic images (2019) [66] 84 rarely used in the literature, stateful mode of the conventional LSTM is able to improve dramatically its performance on challenging datasets like NTU RGB+D.…”
Section: Methodsmentioning
confidence: 99%
“…Among the different DNN and depth-based approaches for HAR, many of them modify the input to generate depth motion maps [59] or dynamic images [62,60,58,66,65] so as to encode the spatio-temporal information of a complete video into some few images through color and texture patterns. Besides, convolutional neural networks (CNNs), successfully used in image processing tasks, can be extended to a third dimension [44,29] (3D CNN) to handle the temporal extension of videos.…”
Section: Introductionmentioning
confidence: 99%
“…Most algorithms for depth-based action recognition are based on 3D positions of body joints, which can be determined, for instance, by the MS Kinect sensor [4]. However, as pointed out in a recent work [5], there are only few papers devoted to depth-based human action recognition using convolutional neural networks (CNNs). One of the reasons is that unlike RGB video-based activity analysis, 3D action recognition suffers from the lack of large-scale benchmark datasets.…”
Section: Introductionmentioning
confidence: 99%
“…To improve the recognition accuracy of skeleton movements, researchers need to use deep learning technology to simulate the spatial-temporal nature of bone sequences [5,6]. Examples include the recursive neural network (RNN) [7,8], deep convolutional neural network (CNN) [9][10][11][12], and attention [13,14] and graph convolutional network (GCN) [15][16][17][18][19]. In the early stage, RNN/LSTM uses short-term and long-term timing sequence dynamics to model the bone sequence, while CNN adjusts the bone data to the appropriate input (224 × 224) and learns the correlation.…”
Section: Introductionmentioning
confidence: 99%