2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00467
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

Abstract: Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
96
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 101 publications
(117 citation statements)
references
References 44 publications
0
96
0
Order By: Relevance
“…The relations between visual features and semantic features. The work of zero-shot action recognition [4] successfully utilizes Word2Vec to encode the knowledge of semantic information from natural language. The difference between the task of zero-shot action recognition and that of few-shot action recognition is that zero-shot learning has no support from videos, while few-shot learning can rely on both videos and semantic embeddings to classify previously unseen categories.…”
Section: Semantic Space Projectionmentioning
confidence: 99%
“…The relations between visual features and semantic features. The work of zero-shot action recognition [4] successfully utilizes Word2Vec to encode the knowledge of semantic information from natural language. The difference between the task of zero-shot action recognition and that of few-shot action recognition is that zero-shot learning has no support from videos, while few-shot learning can rely on both videos and semantic embeddings to classify previously unseen categories.…”
Section: Semantic Space Projectionmentioning
confidence: 99%
“…Unseen action classification In Table 9, we show the unseen classification accuracies on UCF101 for three common Brattoli et al (2020). Each approach employs different prior knowledge, making a direct comparison difficult.…”
Section: Comparative Evaluationmentioning
confidence: 99%
“…The train and test columns denote the number of action used for training and testing. Our approach is state-of-the-art in the unseen setting, where no training actions are used, and competitive toZhu et al (2018) andBrattoli et al (2020), who require extensive training on ActivityNet and Kinetics respectively…”
mentioning
confidence: 99%
“…Zou et al [60] propose a soft composition mechanism to investigate compositional recognition that human can perform, which has been well studied in cognitive science, but not well explored under few-shot learning setting. Brattoli et al [2] conduct an in-depth analysis of end-to-end training and pre-trained backbones for zero-shot learning.…”
Section: Few-shot Learningmentioning
confidence: 99%