Visual Consensus Modeling for Video-Text Retrieval

Cao, Shuqiang; Wang, Bairui; Zhang, Wei; Ma, Lin

doi:10.1609/aaai.v36i1.19891

Cited by 16 publications

(12 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Commonsense has also been incorporated into tasks such as video captioning (Yu et al 2021), video question answering (Li, Niu, and Zhang 2022), and visual story generation (Maharana and Bansal 2021). Existing methods enhance query-based video retrieval using a co-occurrence graph of concepts mined from the target video moment (Wu et al 2022;Cao et al 2022). However, both are proposal-based fully supervised approaches that rely on fine-grained annotations and the quality of candidate video moments, let alone solely exploit the internal relations between the detected visual ob-jects through a co-occurrence graph of entities as opposed to using external knowledge sources.…”

Section: Commonsense In Video-language Tasksmentioning

confidence: 99%

“…Natural Language Video Localization (NLVL) is a fundamental multimodal understanding task that aims to align textual queries with relevant video segments. NLVL is a core component for various applications such as video moment retrieval (Cao et al 2022), video question answering (Qian et al 2023;Lei et al 2020a), and video editing (Gao et al 2022). Prior works have primarily explored supervised (Zeng et al 2020;Wang, Ma, and Jiang 2020;Soldan et al 2021;Liu et al 2021;Yu et al 2020; or weakly supervised (Mun, Cho, and Han 2020;Zhang et al 2020Zhang et al , 2021) NLVL methodologies, relying on annotated video-query data to various extents.…”

Section: Introductionmentioning

confidence: 99%

“…However, in the zeroshot/pseudo-supervised setting, where queries are in a simpler form without structural information, finding common ground between modalities becomes crucial for effective cross-modal interactions. Commonsense knowledge, which encompasses general knowledge about the world and relationships between concepts, has proven valuable in various tasks (Fang et al 2020;Ding et al 2021;Yu et al 2021;Li, Niu, and Zhang 2022;Maharana and Bansal 2021;Cao et al 2022). By incorporating commonsense information, NLVL models could potentially bridge the semantic gap between video and text modalities, enhancing the cross-modal understanding and performance in zero-shot NLVL.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

show abstract

Section: Commonsense In Video-language Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Most of the existing video-text retrieval models use multilayer transformers to learn generic representations from massive video-text pairs, which can be roughly divided into two categories. The first category uses only frame features to transfer the knowledge of image-text pretrained model to video-text retrieval task without fully exploring the multimodal information of videos (Lei et al 2021;Luo et al 2021;Fang et al 2021;Cheng et al 2021;Cao et al 2022). A representative method is CLIP4Clip (Luo et al 2021), which utilizes the knowledge of the CLIP (Contrastive Language-Image Pretraining) (Radford et al 2021) model to visually encode multi-frame information as an overall representation of the video.…”

Section: Introductionmentioning

confidence: 99%

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Chen¹,

Wang²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multimodal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.

show abstract

“…Vision-language retrieval, such as image-text retrieval [10,48,47] and video-text retrieval [34,16,17,3,37], etc., is formulated to retrieve relevant samples across different vision and language modalities. Compared to unimodal image retrieval, vision-language retrieval is more challenging due to the heterogeneous gap between query and candidates.…”

Section: Introductionmentioning

confidence: 99%

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Li¹,

Guo²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

There are two popular loss functions used for visionlanguage retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for visionlanguage retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.

show abstract

Visual Consensus Modeling for Video-Text Retrieval

Cited by 16 publications

References 38 publications

Commonsense for Zero-Shot Natural Language Video Localization

Commonsense for Zero-Shot Natural Language Video Localization

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Contact Info

Product

Resources

About