2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00713
|View full text |Cite
|
Sign up to set email alerts
|

Boundary-sensitive Pre-training for Temporal Localization in Videos

Abstract: Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 51 publications
(22 citation statements)
references
References 86 publications
0
22
0
Order By: Relevance
“…Another limitation is the need for many human labeled videos for training and the constraint of a pre-defined vocabulary of actions. Interesting future directions include pre-training for action localization [2,75], and learning from videos and text corpus [30,53] without human labels.…”
Section: Conclusion and Discussionmentioning
confidence: 99%
“…Another limitation is the need for many human labeled videos for training and the constraint of a pre-defined vocabulary of actions. Interesting future directions include pre-training for action localization [2,75], and learning from videos and text corpus [30,53] without human labels.…”
Section: Conclusion and Discussionmentioning
confidence: 99%
“…SNEAK [139] studies adversarial robustness of TSGV models by examining three facets of vulnerabilities, i.e., vision, language, and crossmodal interaction, from both attack and defense aspects. Xu et al [140] further investigate model pre-training for TSGV by constructing a large-scale synthesized dataset with annotations, and designing a novel boundary-sensitive pretext task. Cao et al [141] reformulate TSGV as a set prediction task, and propose a multimodal transformer model inherited from DETR [142].…”
Section: Other Supervised Methodsmentioning
confidence: 99%
“…However, the visual and textual features remain separately generated by different pre-trained extractors. Xu et al [140] propose a pre-training strategy for TSGV by constructing a large-scale synthesized dataset with TSGV annotations. Inspired by ViT [167], Cao et al [141] develop a video cubic embedding module to extract 3D visual tokens and learn video content from scratch without reliance on pretrained visual feature extractor.…”
Section: Effective Feature Extractor(s)mentioning
confidence: 99%
“…Compared to the retrieval tasks [64,45,21] which only require only video-level predictions, localization tasks [15,51,77] are essentially different since they need dense clip-level or frame-level predictions and thus the pre-training for these tasks is more challenging. In the pure video domain, this gap has been noticed and several pre-training works [65,2,66,73] tailored for action localization have been proposed. BSP [65] synthesizes temporal boundaries using existing action recognition datasets and conducts boundary type classification to generate localization-friendly features.…”
Section: Related Workmentioning
confidence: 99%