Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag, Sauradip; Zhu, Xiatian; Song, Yi-Zhe; Xiang, Tao

doi:10.1007/978-3-031-20062-5_39

Cited by 16 publications

(24 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To avoid labor and increase flexibility, some studies [30,31,32] propose learnable prompt tuning at the textual stream, showing strong low-shot generalization. In the CV domain, some recent papers [94,95,20] introduce such randomly initialized prompt tuning to handle visual tasks, e.g., image understanding [96,41,79,45] and video understanding [17,51,56]. However, these studies ignore lexical ambiguity of category names, and cases that are not easy to describe in text.…”

Section: Related Workmentioning

confidence: 99%

“…Low-Shot Temporal Action Localization considers more realistic scenarios: generalize TAL towards action categories that are unseen (zero-shot) or with several support samples (few-shot). Existing methods [20,51,88,2] most rely on foundational models pre-trained on large-scale image-caption pairs for help. Typically, E-Prompt [20] is the first to construct wide baselines with popular prompt tuning [30,31] and vanilla temporal modeling.…”

Section: Related Workmentioning

confidence: 99%

“…Typically, E-Prompt [20] is the first to construct wide baselines with popular prompt tuning [30,31] and vanilla temporal modeling. STALE [51] explores the one-stage framework to further simplify usage. Although promising, all above methods meet two main challenges: (1) For category semantics, the definition may be vague, inaccurate, or incomplete.…”

Section: Related Workmentioning

confidence: 99%

“…Splits. Following literature [20,51], we adopt two types of splits for zero-shot scenarios. The 75:25 split: train on 75% base categories and test on 25% novel categories.…”

Section: Datasets and Metricsmentioning

confidence: 99%

“…In the recent literature, another line of research [20,51] considers a more challenging problem, that requires the vision system to handle both seen and unseen categories, with low-shot (zero or only few) examples at inference time, this problem is often termed as openvocabulary temporal action localization. To tackle the problem, existing studies [20,51,42] take inspiration from large-scale foundational models [59,16,80], casting the problem of action classification in the form of crossmodal retrieval, i.e., for one action videos, searching its closest category embedding in text form (e.g., "an action video of class"). However, such a design potentially suffers from the lexical ambiguities, as multiple actions may share category names, despite its differing visual appearance.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Chen¹,

Zhao²,

Chen³

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing lowshot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visuallyconditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…Splits. Following literature [20,51], we adopt two types of splits for zero-shot scenarios. The 75:25 split: train on 75% base categories and test on 25% novel categories.…”

Section: Datasets and Metricsmentioning

confidence: 99%