2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00331
|View full text |Cite
|
Sign up to set email alerts
|

Object-aware Video-language Pre-training for Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 47 publications
(24 citation statements)
references
References 21 publications
0
24
0
Order By: Relevance
“…Video-text Contrastive (VTC) [2,13,14,23,49,50]. As detailed in Section 3, VTC contrasts the outputs of two single-modal encoders to pull close their embedding space to help the subsequent cross-modal encoder build more robust vision-language associations.…”
Section: Training Objectivesmentioning
confidence: 99%
See 2 more Smart Citations
“…Video-text Contrastive (VTC) [2,13,14,23,49,50]. As detailed in Section 3, VTC contrasts the outputs of two single-modal encoders to pull close their embedding space to help the subsequent cross-modal encoder build more robust vision-language associations.…”
Section: Training Objectivesmentioning
confidence: 99%
“…Following recent work [2,13,14,23,49], we pre-train TW-BERT on Google Conceptual Captions (CC3M) [45] containing 3.3M image-text pairs and WebVid-2M [2] containing 2.5M video-text pairs. For CC3M, the image is treated as a one-frame video data during pre-training.…”
Section: Pre-training Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…In BridgeFormer (Ge et al, 2022), the authors exploit the rich semantics of text (i.e., nouns and verbs) to build question-answer pairs to form a question answering task as a pretext task, with which the model can be trained to capture more re-gional content and temporal dynamics. Wang et al (2022c) proposes an object-aware Transformer to leverage bounding boxes and object tags to guide the training process.…”
Section: Advanced Pre-training Tasksmentioning
confidence: 99%
“…Vision-language retrieval, such as image-text retrieval [10,48,47] and video-text retrieval [34,16,17,3,37], etc., is formulated to retrieve relevant samples across different vision and language modalities. Compared to unimodal image retrieval, vision-language retrieval is more challenging due to the heterogeneous gap between query and candidates.…”
Section: Introductionmentioning
confidence: 99%