2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019
DOI: 10.1109/wacv.2019.00015
|View full text |Cite
|
Sign up to set email alerts
|

Where to Focus on for Human Action Recognition?

Abstract: In this paper, we present a new attention model for the recognition of human action from RGB-D videos. We propose an attention mechanism based on 3D articulated pose. The objective is to focus on the most relevant body parts involved in the action. For action classification, we propose a classification network compounded of spatio-temporal subnetworks modeling the appearance of human body parts and RNN attention subnetwork implementing our attention mechanism. Furthermore, we train our proposed network end-to-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
32
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 32 publications
(34 citation statements)
references
References 36 publications
0
32
0
Order By: Relevance
“…The attention mechanism of non-local blocks [35] from convolutional feature maps are not viewinvariant and thus perform worse than simple I3D as backbone of the Temporal Model in CV protocols. P-I3D [8] with 42M trainable parameters as compared to simple I3D's 12M trainable parameters outperforms the state-of-the-art results on NTU (95% average over CS and CV) and NU-CLA (93.5%) datasets when used as a backbone of the Temporal Model. The Global Model with P-I3D as base network has 80M trainable parameters and improves action, with similar motion like wearing glasses (+2.5%) and taking off glasses (+2.1%) compared to the Basic Model (P-I3D).…”
Section: Comparison With the State-of-the-artmentioning
confidence: 96%
See 4 more Smart Citations
“…The attention mechanism of non-local blocks [35] from convolutional feature maps are not viewinvariant and thus perform worse than simple I3D as backbone of the Temporal Model in CV protocols. P-I3D [8] with 42M trainable parameters as compared to simple I3D's 12M trainable parameters outperforms the state-of-the-art results on NTU (95% average over CS and CV) and NU-CLA (93.5%) datasets when used as a backbone of the Temporal Model. The Global Model with P-I3D as base network has 80M trainable parameters and improves action, with similar motion like wearing glasses (+2.5%) and taking off glasses (+2.1%) compared to the Basic Model (P-I3D).…”
Section: Comparison With the State-of-the-artmentioning
confidence: 96%
“…However, this operation computing the affinity between the features does not go beyond the spatio-temporal cube, thus does not account for long-term temporal relations. For ADL recognition, Das et al [8] proposed a spatial attention mechanism on the spatio-temporal features extracted from I3D network. The spatial attention provides soft-weights to the pertinent human body parts relevant to the action.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations