Object-aware Video-language Pre-training for Retrieval

Wang, Alex Jinpeng; Ge, Yixiao; Cai, Guangming; Yan, Rui; Lin, Xudong; Shan, Ying; Qie, Xiaohu; Shou, Mike Zheng

doi:10.1109/cvpr52688.2022.00331

Cited by 47 publications

(24 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video-text Contrastive (VTC) [2,13,14,23,49,50]. As detailed in Section 3, VTC contrasts the outputs of two single-modal encoders to pull close their embedding space to help the subsequent cross-modal encoder build more robust vision-language associations.…”

Section: Training Objectivesmentioning

confidence: 99%

“…Following recent work [2,13,14,23,49], we pre-train TW-BERT on Google Conceptual Captions (CC3M) [45] containing 3.3M image-text pairs and WebVid-2M [2] containing 2.5M video-text pairs. For CC3M, the image is treated as a one-frame video data during pre-training.…”

Section: Pre-training Datasetmentioning

confidence: 99%

“…T2W: We replace P2W attention in Base by our T2W attention. ConCat: Since our cross-modal encoder is asymmetric, we concatenate the text and video [CLS] tokens to calculate the VTM loss, which is different from the symmetric ones [11,23,49] that only use text [CLS] token. TW-BERT: We additionally apply finegrained align-before-fuse to the baseline ConCat and then the integral TW-BERT is built.…”

Section: Ablation Studiesmentioning

confidence: 99%

See 2 more Smart Citations

Learning Trajectory-Word Alignments for Video-Language Tasks

Xu¹,

Zhangzikang²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Aligning objects with words plays a critical role in Image-Language BERT (IL-BERT) and Video-Language BERT (VDL-BERT). Different from the image case where an object covers some spatial patches, an object in a video usually appears as an object trajectory, i.e., it spans over a few spatial but longer temporal patches and thus contains abundant spatiotemporal contexts. However, modern VDL-BERTs neglect this trajectory characteristic that they usually follow IL-BERTs to deploy the patch-to-word (P2W) attention while such attention may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment for solving videolanguage tasks. Such alignment is learned by a newly designed trajectory-to-word (T2W) attention. Besides T2W attention, we also follow previous VDL-BERTs to set a wordto-patch (W2P) attention in the cross-modal encoder. Since T2W and W2P attentions have diverse structures, our crossmodal encoder is asymmetric. To further help this asymmetric cross-modal encoder build robust vision-language associations, we propose a fine-grained "align-before-fuse" strategy to pull close the embedding spaces calculated by the video and text encoders. By the proposed strategy and T2W attention, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.

show abstract

Section: Training Objectivesmentioning

confidence: 99%

Section: Pre-training Datasetmentioning

confidence: 99%

Section: Ablation Studiesmentioning

confidence: 99%

See 1 more Smart Citation

Learning Trajectory-Word Alignments for Video-Language Tasks

Xu¹,

Zhangzikang²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In BridgeFormer (Ge et al, 2022), the authors exploit the rich semantics of text (i.e., nouns and verbs) to build question-answer pairs to form a question answering task as a pretext task, with which the model can be trained to capture more re-gional content and temporal dynamics. Wang et al (2022c) proposes an object-aware Transformer to leverage bounding boxes and object tags to guide the training process.…”

Section: Advanced Pre-training Tasksmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

“…Vision-language retrieval, such as image-text retrieval [10,48,47] and video-text retrieval [34,16,17,3,37], etc., is formulated to retrieve relevant samples across different vision and language modalities. Compared to unimodal image retrieval, vision-language retrieval is more challenging due to the heterogeneous gap between query and candidates.…”

Section: Introductionmentioning

confidence: 99%

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Li¹,

Guo²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

There are two popular loss functions used for visionlanguage retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for visionlanguage retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.

show abstract

Object-aware Video-language Pre-training for Retrieval

Cited by 47 publications

References 21 publications

Learning Trajectory-Word Alignments for Video-Language Tasks

Learning Trajectory-Word Alignments for Video-Language Tasks

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Contact Info

Product

Resources

About