2022
DOI: 10.1609/aaai.v36i1.19891
|View full text |Cite
|
Sign up to set email alerts
|

Visual Consensus Modeling for Video-Text Retrieval

Abstract: In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(12 citation statements)
references
References 38 publications
0
9
0
Order By: Relevance
“…Commonsense has also been incorporated into tasks such as video captioning (Yu et al 2021), video question answering (Li, Niu, and Zhang 2022), and visual story generation (Maharana and Bansal 2021). Existing methods enhance query-based video retrieval using a co-occurrence graph of concepts mined from the target video moment (Wu et al 2022;Cao et al 2022). However, both are proposal-based fully supervised approaches that rely on fine-grained annotations and the quality of candidate video moments, let alone solely exploit the internal relations between the detected visual ob-jects through a co-occurrence graph of entities as opposed to using external knowledge sources.…”
Section: Commonsense In Video-language Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…Commonsense has also been incorporated into tasks such as video captioning (Yu et al 2021), video question answering (Li, Niu, and Zhang 2022), and visual story generation (Maharana and Bansal 2021). Existing methods enhance query-based video retrieval using a co-occurrence graph of concepts mined from the target video moment (Wu et al 2022;Cao et al 2022). However, both are proposal-based fully supervised approaches that rely on fine-grained annotations and the quality of candidate video moments, let alone solely exploit the internal relations between the detected visual ob-jects through a co-occurrence graph of entities as opposed to using external knowledge sources.…”
Section: Commonsense In Video-language Tasksmentioning
confidence: 99%
“…Natural Language Video Localization (NLVL) is a fundamental multimodal understanding task that aims to align textual queries with relevant video segments. NLVL is a core component for various applications such as video moment retrieval (Cao et al 2022), video question answering (Qian et al 2023;Lei et al 2020a), and video editing (Gao et al 2022). Prior works have primarily explored supervised (Zeng et al 2020;Wang, Ma, and Jiang 2020;Soldan et al 2021;Liu et al 2021;Yu et al 2020; or weakly supervised (Mun, Cho, and Han 2020;Zhang et al 2020Zhang et al , 2021) NLVL methodologies, relying on annotated video-query data to various extents.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of the existing video-text retrieval models use multilayer transformers to learn generic representations from massive video-text pairs, which can be roughly divided into two categories. The first category uses only frame features to transfer the knowledge of image-text pretrained model to video-text retrieval task without fully exploring the multimodal information of videos (Lei et al 2021;Luo et al 2021;Fang et al 2021;Cheng et al 2021;Cao et al 2022). A representative method is CLIP4Clip (Luo et al 2021), which utilizes the knowledge of the CLIP (Contrastive Language-Image Pretraining) (Radford et al 2021) model to visually encode multi-frame information as an overall representation of the video.…”
Section: Introductionmentioning
confidence: 99%
“…Vision-language retrieval, such as image-text retrieval [10,48,47] and video-text retrieval [34,16,17,3,37], etc., is formulated to retrieve relevant samples across different vision and language modalities. Compared to unimodal image retrieval, vision-language retrieval is more challenging due to the heterogeneous gap between query and candidates.…”
Section: Introductionmentioning
confidence: 99%