2020
DOI: 10.1609/aaai.v34i07.6833
|View full text |Cite
|
Sign up to set email alerts
|

Learning Cross-Modal Context Graph for Visual Grounding

Abstract: Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic ambiguities. Prior works typically focus on learning representations of individual phrases with limited context information. To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their rel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
46
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 73 publications
(47 citation statements)
references
References 26 publications
0
46
0
Order By: Relevance
“…Existing works can be roughly divided into two categories: supervised and weakly supervised. The supervised methods [12,29,35,51] took bounding boxes as supervision to enforce the alignment between image regions and noun phrases, and have achieved remarkable success. However, annotating the region word alignments is expensive and time-consuming, which makes it difficult to scale to large datasets.…”
Section: Related Workmentioning
confidence: 99%
“…Existing works can be roughly divided into two categories: supervised and weakly supervised. The supervised methods [12,29,35,51] took bounding boxes as supervision to enforce the alignment between image regions and noun phrases, and have achieved remarkable success. However, annotating the region word alignments is expensive and time-consuming, which makes it difficult to scale to large datasets.…”
Section: Related Workmentioning
confidence: 99%
“…Graphs are non-Euclidean structured data, which can effectively represent relationships between nodes. Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]).…”
Section: Graph Construction In V+l Tasksmentioning
confidence: 99%
“…Most of these tasks would benefit from better phrase-to-object localization, a task which attempts to learn a mapping between phrases in the caption and objects in the image by measuring their similarity. Existing works consider the phrase-to-object localization problem under various training scenarios, including supervised learning (Rohrbach et al, 2016;Yu et al, 2018;Liu et al, 2020;Plummer et al, 2015;Li et al, 2019) and weakly-supervised learning (Rohrbach et al, 2016;Yeh et al, 2018;Chen et al, 2018). Besides the standard phrase-object matching setup, previous works (Xiao et al, 2017;Akbari et al, 2019;Datta et al, 2019) have also explored a pixellevel "pointing-game" setting, which is easier to model and evaluate but less realistic.…”
Section: Related Workmentioning
confidence: 99%
“…the caption and particular objects in the image. Existing work (Rohrbach et al, 2016;Kim et al, 2018;Li et al, 2019;Yu et al, 2018;Liu et al, 2020) mainly focuses on the supervised phrase localization setting. This requires a large-scale annotated dataset of phrase-object pairs for model training.…”
Section: Introductionmentioning
confidence: 99%