Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

Brattoli, Biagio; Tighe, Joseph; Zhdanov, Fedor; Perona, Pietro; Chalupka, Krzysztof

doi:10.1109/cvpr42600.2020.00467

Cited by 101 publications

(117 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The relations between visual features and semantic features. The work of zero-shot action recognition [4] successfully utilizes Word2Vec to encode the knowledge of semantic information from natural language. The difference between the task of zero-shot action recognition and that of few-shot action recognition is that zero-shot learning has no support from videos, while few-shot learning can rely on both videos and semantic embeddings to classify previously unseen categories.…”

Section: Semantic Space Projectionmentioning

confidence: 99%

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Few-shot action recognition has drawn growing attention as it can recognize novel action classes by using only a few labeled samples. In this paper, we propose a novel semantic-guided relation propagation network (SRPN), which leverages semantic information together with visual information for few-shot action recognition. Different from most previous works that neglect semantic information in the labeled data, our SRPN directly utilizes the semantic label as an additional supervisory signal to improve the generalization ability of the network. Besides, we treat the relation of each visual-semantic pair as a relational node, and we use a graph convolutional network to model and propagate such sample relations across visual-semantic pairs, including both intra-class commonality and inter-class uniqueness, to guide the relation propagation in the graph. However, since videos contain crucial sequences and ordering information, we propose a novel spatial-temporal difference module, which can facilitate the network to enhance the visual feature learning ability at both feature level and granular level for videos. Extensive experiments conducted on several challenging benchmarks demonstrate that our SRPN outperforms several state-of-the-art methods with a significant margin.

show abstract

Section: Semantic Space Projectionmentioning

confidence: 99%

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Unseen action classification In Table 9, we show the unseen classification accuracies on UCF101 for three common Brattoli et al (2020). Each approach employs different prior knowledge, making a direct comparison difficult.…”

Section: Comparative Evaluationmentioning

confidence: 99%

“…The train and test columns denote the number of action used for training and testing. Our approach is state-of-the-art in the unseen setting, where no training actions are used, and competitive toZhu et al (2018) andBrattoli et al (2020), who require extensive training on ActivityNet and Kinetics respectively…”

mentioning

confidence: 99%

Object Priors for Classifying and Localizing Unseen Actions

2021

View full text Add to dashboard Cite

This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

show abstract

“…Zou et al [60] propose a soft composition mechanism to investigate compositional recognition that human can perform, which has been well studied in cognitive science, but not well explored under few-shot learning setting. Brattoli et al [2] conduct an in-depth analysis of end-to-end training and pre-trained backbones for zero-shot learning.…”

Section: Few-shot Learningmentioning

confidence: 99%

Few-Shot Action Localization without Knowing Boundaries

Xie

Tzelepis

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Learning to localize actions in long, cluttered, and untrimmed videos is a hard task, that in the literature has typically been addressed assuming the availability of large amounts of annotated training samples for each class -either in a fully-supervised setting, where action boundaries are known, or in a weakly-supervised setting, where only class labels are known for each video. In this paper, we go a step further and show that it is possible to learn to localize actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with only class label annotation (some trimmed and some weakly annotated untrimmed ones) are available for training; with no overlap between the classes used during training and testing. To do so, we propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a finegrained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos, and to temporally localize actions at test time. To the best of our knowledge, we are the first to propose a weakly-supervised, one/few-shot action localization network that can be trained in an end-to-end fashion. Experimental results on THUMOS14 and ActivityNet1.2 datasets, show that our method achieves performance comparable or better to state-of-the-art fullysupervised, few-shot learning methods. CCS CONCEPTS• Computing methodologies → Machine learning.

show abstract

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

Cited by 101 publications

References 44 publications

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Object Priors for Classifying and Localizing Unseen Actions

Few-Shot Action Localization without Knowing Boundaries

Contact Info

Product

Resources

About