2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01569
|View full text |Cite
|
Sign up to set email alerts
|

Bridging Video-text Retrieval with Multiple Choice Questions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 83 publications
(28 citation statements)
references
References 24 publications
0
28
0
Order By: Relevance
“…Existing works either learn unimodal encoders by distinguishing the positive pair(s) from the unpaired samples [3,28,41] or focus on one multimodal encoder for joint feature learning with masked image/language modeling and image-text matching losses [12,27,29]. Additionally, some approaches seek fine-grained supervision for cross-modal interaction [20,30,[54][55][56]. For example, GLIP [30] proposed to align the bounding boxes with corresponding phrases in the text.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Existing works either learn unimodal encoders by distinguishing the positive pair(s) from the unpaired samples [3,28,41] or focus on one multimodal encoder for joint feature learning with masked image/language modeling and image-text matching losses [12,27,29]. Additionally, some approaches seek fine-grained supervision for cross-modal interaction [20,30,[54][55][56]. For example, GLIP [30] proposed to align the bounding boxes with corresponding phrases in the text.…”
Section: Related Workmentioning
confidence: 99%
“…Entities vs nouns: masking and predicting noun phrases in the sentence [20] is one feasible option to learn fine-grained vision-text matching. However, noun masking leads to 3.5% lower mIoU than our entity masking strategy.…”
Section: Ablation Studymentioning
confidence: 99%
“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”
Section: Foundational Video-language Modelsmentioning
confidence: 99%
“…During training, the PEM task asks the model to predict the entity pseudo-labels (i.e., normalized similarity scores) for randomly-selected video crops. In BridgeFormer (Ge et al, 2022), the authors exploit the rich semantics of text (i.e., nouns and verbs) to build question-answer pairs to form a question answering task as a pretext task, with which the model can be trained to capture more re-gional content and temporal dynamics. Wang et al (2022c) proposes an object-aware Transformer to leverage bounding boxes and object tags to guide the training process.…”
Section: Advanced Pre-training Tasksmentioning
confidence: 99%