2020
DOI: 10.48550/arxiv.2012.06567
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Comprehensive Study of Deep Video Action Recognition

Yi Zhu,
Xinyu Li,
Chunhui Liu
et al.

Abstract: Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling longrange temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for vi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
60
0
3

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 42 publications
(73 citation statements)
references
References 214 publications
0
60
0
3
Order By: Relevance
“…Learning about human actions from video Substantial work in computer vision explores models for human action recognition in video [69,61,8,19], including analyzing hand-object interactions [55,2,60,7,24] and egocentric video understanding [34,21,37,31,13]. More closely related to our work, visual affordance models derived from video can detect likely places for actions to occur [38,18,40], such as where to grasp a frying pan, and object saliency models can identify human-useable objects in egocentric video [14,5,20].…”
Section: Related Workmentioning
confidence: 83%
“…Learning about human actions from video Substantial work in computer vision explores models for human action recognition in video [69,61,8,19], including analyzing hand-object interactions [55,2,60,7,24] and egocentric video understanding [34,21,37,31,13]. More closely related to our work, visual affordance models derived from video can detect likely places for actions to occur [38,18,40], such as where to grasp a frying pan, and object saliency models can identify human-useable objects in egocentric video [14,5,20].…”
Section: Related Workmentioning
confidence: 83%
“…We can consider the core execution pattern as the short-term action dynamics whereas the gradual changes to the object-of-interest or scene as the long-range ones. Ideally, we would like an HAR model to be able to access both information sources at the same time without any information loss, however, this is not feasible due to hardware limitations and model footprint (Zhu et al, 2020). As a solution, we propose to restrict the repetition sequence lengths by employing sequence summarization or temporal encoding and rank pooling methods, such as Dynamic Images (DIs) (Fernando et al, 2015;Bilen et al, 2017), Motion History Images (MHIs) (Ahad et al, 2012), or a deep encoder network (Wang et al, 2016(Wang et al, , 2021.…”
Section: Highlighting Action-related Effects With Repetitivenessmentioning
confidence: 99%
“…Robust short-term modeling has been achieved in the last decades with elaborate hand-engineered, and more recently deep learning-based, feature descriptors. Long-term modeling is still an issue in the deep learning era, since the generation of robust representations through the hierarchical correlation of deep features does not scale well as the duration of an action increases (Zhu et al, 2020). This has an additional impact on the computational cost both for training and inference.…”
Section: Introductionmentioning
confidence: 99%
“…In a recent 2020 study, Zhu et al [46] summarize the action recognition performance of 25 recent and historical models. They include an inference latency comparison of seven traditional models, including I3D [6], TSN [38], SlowFast [13], and R2+1D [35], on a single GPU, when using a batch size of one (single-instance inference).…”
Section: Related Workmentioning
confidence: 99%