Boundary-sensitive Pre-training for Temporal Localization in Videos

Xu, Mantao; Pérez-Rúa, Juan-Manuel; Escorcia, Víctor; Martínez, Brais; Zhu, Xiatian; Zhang, Li; Ghanem, Bernard; Xiang, Tao

doi:10.1109/iccv48922.2021.00713

Cited by 51 publications

(22 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another limitation is the need for many human labeled videos for training and the constraint of a pre-defined vocabulary of actions. Interesting future directions include pre-training for action localization [2,75], and learning from videos and text corpus [30,53] without human labels.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer-a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local selfattention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 8.7 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0% average mAP) and the more recent EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at https://github. com/happyharrycn/actionformer_release.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…SNEAK [139] studies adversarial robustness of TSGV models by examining three facets of vulnerabilities, i.e., vision, language, and crossmodal interaction, from both attack and defense aspects. Xu et al [140] further investigate model pre-training for TSGV by constructing a large-scale synthesized dataset with annotations, and designing a novel boundary-sensitive pretext task. Cao et al [141] reformulate TSGV as a set prediction task, and propose a multimodal transformer model inherited from DETR [142].…”

Section: Other Supervised Methodsmentioning

confidence: 99%

“…However, the visual and textual features remain separately generated by different pre-trained extractors. Xu et al [140] propose a pre-training strategy for TSGV by constructing a large-scale synthesized dataset with TSGV annotations. Inspired by ViT [167], Cao et al [141] develop a video cubic embedding module to extract 3D visual tokens and learn video content from scratch without reliance on pretrained visual feature extractor.…”

Section: Effective Feature Extractor(s)mentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

“…Compared to the retrieval tasks [64,45,21] which only require only video-level predictions, localization tasks [15,51,77] are essentially different since they need dense clip-level or frame-level predictions and thus the pre-training for these tasks is more challenging. In the pure video domain, this gap has been noticed and several pre-training works [65,2,66,73] tailored for action localization have been proposed. BSP [65] synthesizes temporal boundaries using existing action recognition datasets and conducts boundary type classification to generate localization-friendly features.…”

Section: Related Workmentioning

confidence: 99%

LocVTP: Video-Text Pre-training for Temporal Localization

Cao¹,

Yang²,

Weng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are underexplored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clipword correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies. Codes are available at https://github.com/mengcaopku/LocVTP.

show abstract

Boundary-sensitive Pre-training for Temporal Localization in Videos

Cited by 51 publications

References 86 publications

ActionFormer: Localizing Moments of Actions with Transformers

ActionFormer: Localizing Moments of Actions with Transformers

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

LocVTP: Video-Text Pre-training for Temporal Localization

Contact Info

Product

Resources

About