2022
DOI: 10.1007/978-3-031-20062-5_39
|View full text |Cite
|
Sign up to set email alerts
|

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Abstract: Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support video… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 66 publications
0
24
0
Order By: Relevance
“…To avoid labor and increase flexibility, some studies [30,31,32] propose learnable prompt tuning at the textual stream, showing strong low-shot generalization. In the CV domain, some recent papers [94,95,20] introduce such randomly initialized prompt tuning to handle visual tasks, e.g., image understanding [96,41,79,45] and video understanding [17,51,56]. However, these studies ignore lexical ambiguity of category names, and cases that are not easy to describe in text.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…To avoid labor and increase flexibility, some studies [30,31,32] propose learnable prompt tuning at the textual stream, showing strong low-shot generalization. In the CV domain, some recent papers [94,95,20] introduce such randomly initialized prompt tuning to handle visual tasks, e.g., image understanding [96,41,79,45] and video understanding [17,51,56]. However, these studies ignore lexical ambiguity of category names, and cases that are not easy to describe in text.…”
Section: Related Workmentioning
confidence: 99%
“…Low-Shot Temporal Action Localization considers more realistic scenarios: generalize TAL towards action categories that are unseen (zero-shot) or with several support samples (few-shot). Existing methods [20,51,88,2] most rely on foundational models pre-trained on large-scale image-caption pairs for help. Typically, E-Prompt [20] is the first to construct wide baselines with popular prompt tuning [30,31] and vanilla temporal modeling.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations