Where to Focus on for Human Action Recognition?

Das, Srijan; Chaudhary, Arpit; Brémond, François; Thonnat, Monique

doi:10.1109/wacv.2019.00015

Cited by 32 publications

(34 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The attention mechanism of non-local blocks [35] from convolutional feature maps are not viewinvariant and thus perform worse than simple I3D as backbone of the Temporal Model in CV protocols. P-I3D [8] with 42M trainable parameters as compared to simple I3D's 12M trainable parameters outperforms the state-of-the-art results on NTU (95% average over CS and CV) and NU-CLA (93.5%) datasets when used as a backbone of the Temporal Model. The Global Model with P-I3D as base network has 80M trainable parameters and improves action, with similar motion like wearing glasses (+2.5%) and taking off glasses (+2.1%) compared to the Basic Model (P-I3D).…”

Section: Comparison With the State-of-the-artmentioning

confidence: 96%

“…However, this operation computing the affinity between the features does not go beyond the spatio-temporal cube, thus does not account for long-term temporal relations. For ADL recognition, Das et al [8] proposed a spatial attention mechanism on the spatio-temporal features extracted from I3D network. The spatial attention provides soft-weights to the pertinent human body parts relevant to the action.…”

Section: Related Workmentioning

confidence: 99%

“…Then we propose a focus of attention on -the pertinent temporal segment in a video and the pertinent temporal granularity. Furthermore, the capability of the Temporal Model to combine with the existing 3D CNNs [8,35] stands it out from the existing approaches [34,37] for temporally complex actions.…”

Section: Related Workmentioning

confidence: 99%

“…Consequently, classification is not affected by the wrong poses. Table 2, 3 also shows the effectiveness of our Temporal Model when adopted on top of rich discriminative features from existing spatio-temporal attention models [8,35]. We call them Global Model (I3D-NL base) or (P-I3D base) -the base network in parentheses.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…We make two hypotheses: the input video clip contains a single class label, and the articulated poses are available. Inspired from the recent trend of using poses to guide RGB cue [3,4,8], we take the articulated poses as input to the attention module. The articulated poses are highly informative, robust to rotation and illumination, and thus provides a strong clue to select the pertinent subsequences in a video.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Looking deeper into Time for Activities of Daily Living Recognition

Das

Thonnat

Brémond

2020

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce a new approach for Activities of Daily Living (ADL) recognition. In order to discriminate between activities with similar appearance and motion, we focus on their temporal structure. Actions with subtle and similar motion are hard to disambiguate since long-range temporal information is hard to encode. So, we propose an end-to-end Temporal Model to incorporate long-range temporal information without losing subtle details. The temporal structure is represented globally by different temporal granularities and locally by temporal segments. We also propose a two-level pose driven attention mechanism to take into account the relative importance of the segments and granularities. We validate our approach on 2 public datasets: a 3D human activity dataset (NTU-RGB+D) and a human action recognition dataset with object interaction dataset (Northwestern-UCLA Multiview Action 3D). Our Temporal Model can also be incorporated with any existing 3D CNN (including attention based) as a backbone which reveals its robustness.

show abstract

Section: Comparison With the State-of-the-artmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Looking deeper into Time for Activities of Daily Living Recognition

Das

Thonnat

Brémond

2020

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

Self Cite

View full text Add to dashboard Cite

show abstract

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Das

Sharma

Dai

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Many attempts have been made towards combining RGB and 3D poses for the recognition of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model fine-grained details to distinguish them. Because the recent 3D ConvNets are too rigid to capture the subtle visual patterns across an action, this research direction is dominated by methods combining RGB and 3D Poses. But the cost of computing 3D poses from RGB stream is high in the absence of appropriate sensors. This limits the usage of aforementioned approaches in real-world applications requiring low latency. Then, how to best take advantage of 3D Poses for recognizing ADL? To this end, we propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN), exploring two distinct directions. One is to transfer the Pose knowledge into RGB through a feature-level distillation and the other towards mimicking pose driven attention through an attention-level distillation. Finally, these two approaches are integrated into a single model, we call VPN++. We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses. VPN++, with or without 3D Poses, outperforms the representative baselines on 4 public datasets. Code is available at https://github.com/srijandas07/vpnplusplus.

show abstract