Zero-shot learning for action recognition using synthesized features

Mishra, Ashish; Pandey, Anubha; Murthy, Hema A.

doi:10.1016/j.neucom.2020.01.078

Cited by 38 publications

(14 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The former assumes that only the labeled videos from the seen categories are available during training while the latter can use the unlabeled data of the unseen categories for model training. Specifically, in this work, we focus on inductive ZSAR [12], [15], [26], [42] and do not discuss the transductive approach [9], [32].…”

Section: Methodsmentioning

confidence: 99%

Learning Using Privileged Information for Zero-Shot Action Recognition

Gao¹,

Hou²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Learning Using Privileged Information for Zero-Shot Action Recognition

Gao¹,

Hou²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…Specifically, in the former line of research, only a few training samples are available from each action category, [82,83] proposed compound memory networks to classify videos by matching and ranking; [11] used GANs to synthesize training examples for novel categories; [6] proposed differentiable dynamic time warping to align videos of different lengths; [54] exploited CrossTransformer, to find temporally-corresponding frame tuples between the query and given few-shot videos. While in openset action recognition, it requires the model to generalise towards action categories that are unseen in the training set, one typical idea lies in learning a common representation space that is shared by seen and unseen actions, such as attributes space [19,42], semantic space [20,36], synthesizing features to unseen actions [49], using objects to create common space for unseen actions [46].…”

Section: Related Workmentioning

confidence: 99%

Prompting Visual-Language Models for Efficient Video Understanding

Chen¹,

Han²,

Kunhao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Visual-language pre-training has shown great success for learning joint visual-textual representations from largescale web data, demonstrating remarkable ability for "zeroshot" generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as "continuous prompt vectors", that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.

show abstract

“…Li et al (2016) and Tian et al (2018) map features from videos to a semantic space shared by seen and unseen actions, while Gan et al ((2016c)) train a classifier for unseen actions by performing several levels of relatedness to seen actions. Other works propose to synthesize features for unseen actions (Mishra et al 2018(Mishra et al , 2020, learn a universal representation of actions (Zhu et al 2018), or differentiate seen from unseen actions through out-of-distribution detection (Mandal et al 2019). All these works eliminate the need for attributes for unseen action classification.…”

Section: Unseen Action Classificationmentioning

confidence: 99%

Object Priors for Classifying and Localizing Unseen Actions

2021

View full text Add to dashboard Cite

This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

show abstract

Zero-shot learning for action recognition using synthesized features

Cited by 38 publications

References 3 publications

Learning Using Privileged Information for Zero-Shot Action Recognition

Learning Using Privileged Information for Zero-Shot Action Recognition

Prompting Visual-Language Models for Efficient Video Understanding

Object Priors for Classifying and Localizing Unseen Actions

Contact Info

Product

Resources

About