A Comprehensive Study of Deep Video Action Recognition

Zhu, Yi; Li, Xinyu; Liu, Chunhui; Zolfaghari, Mohammadreza; Xiong, Yuanjun; Wu, Chongruo; Zhang, Zhi; Tighe, Joseph; Manmatha, R.; Li, Mu

doi:10.48550/arxiv.2012.06567

Cited by 42 publications

(73 citation statements)

References 214 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Learning about human actions from video Substantial work in computer vision explores models for human action recognition in video [69,61,8,19], including analyzing hand-object interactions [55,2,60,7,24] and egocentric video understanding [34,21,37,31,13]. More closely related to our work, visual affordance models derived from video can detect likely places for actions to occur [38,18,40], such as where to grasp a frying pan, and object saliency models can identify human-useable objects in egocentric video [14,5,20].…”

Section: Related Workmentioning

confidence: 83%

Shaping embodied agent behavior with activity-context priors from egocentric video

Nagarajan,

Grauman

2021

Preprint

View full text Add to dashboard Cite

Complex physical tasks entail a sequence of object interactions, each with its own preconditions-which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of other compatible objects that are required for activities to succeed (e.g., a knife and cutting board brought together with a tomato are conducive to cutting). We encode our video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction. In this way, our model translates everyday human experience into embodied agent skills. We demonstrate our idea using egocentric EPIC-Kitchens video of people performing unscripted kitchen activities to benefit virtual household robotic agents performing various complex tasks in AI2-iTHOR, significantly accelerating agent learning. Project page: http://vision.cs.utexas.edu/projects/ego-rewards/

show abstract

Section: Related Workmentioning

confidence: 83%

Shaping embodied agent behavior with activity-context priors from egocentric video

Nagarajan,

Grauman

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We can consider the core execution pattern as the short-term action dynamics whereas the gradual changes to the object-of-interest or scene as the long-range ones. Ideally, we would like an HAR model to be able to access both information sources at the same time without any information loss, however, this is not feasible due to hardware limitations and model footprint (Zhu et al, 2020). As a solution, we propose to restrict the repetition sequence lengths by employing sequence summarization or temporal encoding and rank pooling methods, such as Dynamic Images (DIs) (Fernando et al, 2015;Bilen et al, 2017), Motion History Images (MHIs) (Ahad et al, 2012), or a deep encoder network (Wang et al, 2016(Wang et al, , 2021.…”

Section: Highlighting Action-related Effects With Repetitivenessmentioning

confidence: 99%

“…Robust short-term modeling has been achieved in the last decades with elaborate hand-engineered, and more recently deep learning-based, feature descriptors. Long-term modeling is still an issue in the deep learning era, since the generation of robust representations through the hierarchical correlation of deep features does not scale well as the duration of an action increases (Zhu et al, 2020). This has an additional impact on the computational cost both for training and inference.…”

Section: Introductionmentioning

confidence: 99%

Exploiting the Nature of Repetitive Actions for Their Effective and Efficient Recognition

Bacharidis¹,

Argyros²

2022

Front. Comput. Sci.

View full text Add to dashboard Cite

In the field of human action recognition (HAR), the recognition of actions with large duration is hindered by the memorization capacity limitations of the standard probabilistic and recurrent neural network (R-NN) approaches that are used for temporal sequence modeling. The simplest remedy is to employ methods that reduce the input sequence length, by performing window sampling, pooling, or key-frame extraction. However, due to the nature of the frame selection criteria or the employed pooling operations, the majority of these approaches do not guarantee that the useful, discriminative information is preserved. In this work, we focus on the case of repetitive actions. In such actions, a discriminative, core execution motif is maintained throughout each repetition, with slight variations in execution style and duration. Additionally, scene appearance may change as a consequence of the action. We exploit those two key observations on the nature of repetitive actions to build a compact and efficient representation of long actions by maintaining the discriminative sample information and removing redundant information which is due to task repetitiveness. We show that by partitioning an input sequence based on repetition and by treating each repetition as a discrete sample, HAR models can achieve an increase of up to 4% in action recognition accuracy. Additionally, we investigate the relation between the dataset and action set attributes with this strategy and explore the conditions under which the utilization of repetitiveness for input sequence sampling, is a useful preprocessing step in HAR. Finally, we suggest deep NN design directions that enable the effective exploitation of the distinctive action-related information found in repetitiveness, and evaluate them with a simple deep architecture that follows these principles.

show abstract

“…In a recent 2020 study, Zhu et al [46] summarize the action recognition performance of 25 recent and historical models. They include an inference latency comparison of seven traditional models, including I3D [6], TSN [38], SlowFast [13], and R2+1D [35], on a single GPU, when using a batch size of one (single-instance inference).…”

Section: Related Workmentioning

confidence: 99%

Evaluating Transformers for Lightweight Action Recognition

Koot¹,

Hennerbichler²,

2021

Preprint

View full text Add to dashboard Cite

In video action recognition, transformers consistently reach state-of-the-art accuracy. However, many models are too heavyweight for the average researcher with limited hardware resources. In this work, we explore the limitations of video transformers for lightweight action recognition. We benchmark 13 video transformers and baselines across 3 large-scale datasets and 10 hardware devices. Our study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a wide range of video transformers under the same conditions. We categorize current methods into three classes and show that composite transformers that augment convolutional backbones are best at lightweight action recognition, despite lacking accuracy. Meanwhile, attention-only models need more motion modeling capabilities and stand-alone attention block models currently incur too much latency overhead. Our experiments conclude that current video transformers are not yet capable of lightweight action recognition on par with traditional convolutional baselines, and that the previously mentioned shortcomings need to be addressed to bridge this gap. Code to reproduce our experiments will be made publicly available.

show abstract

A Comprehensive Study of Deep Video Action Recognition

Cited by 42 publications

References 214 publications

Shaping embodied agent behavior with activity-context priors from egocentric video

Shaping embodied agent behavior with activity-context priors from egocentric video

Exploiting the Nature of Repetitive Actions for Their Effective and Efficient Recognition

Evaluating Transformers for Lightweight Action Recognition

Contact Info

Product

Resources

About