2021
DOI: 10.48550/arxiv.2112.01194
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video-Text Pre-training with Learned Regions

Abstract: Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatiotemporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 69 publications
(85 reference statements)
0
4
0
Order By: Relevance
“…The modules that enable efficient integration into main frameworks are referred to as "plug-and-play" approaches. In recent years, the plug-and-play manners have attracted more attention in various fields, including image restoration [45]- [48], visual captioning [49], [50], visual question answering [51], [52], and video-text matching [53], [54]. By decoupling a specific problem from overall optimization objectives, they greatly simplify the integration process of each module, and improve flexibility and generalizability on new frameworks, thus accelerating the developments over other more sophisticated applications.…”
Section: Plug-and-play Methodsmentioning
confidence: 99%
“…The modules that enable efficient integration into main frameworks are referred to as "plug-and-play" approaches. In recent years, the plug-and-play manners have attracted more attention in various fields, including image restoration [45]- [48], visual captioning [49], [50], visual question answering [51], [52], and video-text matching [53], [54]. By decoupling a specific problem from overall optimization objectives, they greatly simplify the integration process of each module, and improve flexibility and generalizability on new frameworks, thus accelerating the developments over other more sophisticated applications.…”
Section: Plug-and-play Methodsmentioning
confidence: 99%
“…VideoBERT (Sun et al 2019b), ActBert (Zhu and Yang 2020), DECEMBERT (Tang, Lei, and Bansal 2021), and VIOLET (Fu et al 2021) pre-train matching tasks using the special token [CLS] for binary classification (Ruan and Jin 2022) with a cross-modal encoder (Vaswani et al 2017). Some methods (Zellers et al 2021;Ge et al 2022;Miech et al 2019Miech et al , 2020Ging et al 2020;Wang et al 2022b;Yang, Bisk, and Gao 2021;Yan et al 2021;Luo et al 2021;Patrick et al 2020;Cai et al 2022;Li et al 2020;Xu et al 2021b;Cai et al 2022;Cao et al 2022) pre-train matching tasks with two-stream encoders by forcing the paired samples closer while pushing different ones away (Ruan and Jin 2022). The others (Luo et al 2020;Li et al 2022) combine cross-modal Transformer matching tasks and twostream encoders matching tasks for more vital learning ability.…”
Section: Related Workmentioning
confidence: 99%
“…Encoders. Following [4,54,67], we adopted ViT-B/16 [13] with space-time attention [5] as the video encoder. The spatial attention weights in the transformer were initialized with ImageNet-21k pre-trained weights while the temporal attention weights were set to zero.…”
Section: Settings Of Pre-trainingmentioning
confidence: 99%
“…Video-Text Pre-training (VTP) [49,39,31,30,26,4,67,54] has attracted increasing attention with the aim to learn generic and transferable joint video-language (VL) representations. Compared to the conventional separate pre-training on each single modality, e.g., video features are pre-trained under the action recognition datasets (Kinetics [23], Sport1M [22]), VTP has several advantages: 1) It leverages large-scale unlabeled narrated video data with automatically generated corresponding text data for video-text correspondence pre-training.…”
Section: Introductionmentioning
confidence: 99%