Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.20
|View full text |Cite
|
Sign up to set email alerts
|

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Abstract: Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 23 publications
(12 citation statements)
references
References 40 publications
0
12
0
Order By: Relevance
“…(2) The pretraining model: VD-BERT [1] and VisDial-BERT [22]. (4) Graph-based models: GNN-EM [17], DualVD [19], FGA [18], GoG [6], KBGN [21].…”
Section: Baseline Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…(2) The pretraining model: VD-BERT [1] and VisDial-BERT [22]. (4) Graph-based models: GNN-EM [17], DualVD [19], FGA [18], GoG [6], KBGN [21].…”
Section: Baseline Methodsmentioning
confidence: 99%
“…Recently, with the rise of pre-trained models [2], researchers have begun to explore vision-and-language task [3,4,5] with pre-trained models [1]. Specifically, visual dialog [6,7,8,9], which aims to hold a meaningful conversation with a human about a given image, is a challenging task that requires models have sufficient cross-modal understanding based on both visual and textual context to answer the current question.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, how to effectively realize the multi-modal representation learning and cross-modal semantic relation reasoning on rich underlying semantic structures of visual information and dialogue context is one of the key challenge. Researches propose to model images or videos and dialogue as the graph structure [10,34,203] and conduct cross attention-based reasoning [17,118,139] to perform fine-grained cross-modal relation reasoning for reasonable responses generation, see details in section 3.3.…”
Section: Research Challenges In Vadmentioning
confidence: 99%
“…Although above works have employed graph-based structure, their models still lack explicitly capturing complex relations within visual information or textual contexts. Chen et al [10] produce the graph-over-graph network (GoG), which consists of three cross modalities graph to capture relations and dependencies between query words, dialogue history and visual objects in imagebased dialogue. Then the high-level representation of cross-modal information is used to generate visually and contextually coherent responses.…”
Section: Graph-based Semantic Relationmentioning
confidence: 99%
“…Visual Dialog (VD), which expects AI agents to conduct visually related dialog, has attracted growing interests due to its research significance and application prospects. Most of the work Niu et al, 2019;Gan et al, 2019;Chen et al, 2020;Agarwal et al, 2020;Nguyen et al, 2020;Chen et al, 2021) pays attention to modeling an Answerer agent. However, it is also important to model a VD Questioner agent that can constantly ask visually related and informative questions.…”
Section: Introductionmentioning
confidence: 99%