GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Chen, Feilong; Chen, Xiuyi; Meng, Fandong; Zhou, Jie

doi:10.18653/v1/2021.findings-acl.20

Cited by 23 publications

(12 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2) The pretraining model: VD-BERT [1] and VisDial-BERT [22]. (4) Graph-based models: GNN-EM [17], DualVD [19], FGA [18], GoG [6], KBGN [21].…”

Section: Baseline Methodsmentioning

confidence: 99%

“…Recently, with the rise of pre-trained models [2], researchers have begun to explore vision-and-language task [3,4,5] with pre-trained models [1]. Specifically, visual dialog [6,7,8,9], which aims to hold a meaningful conversation with a human about a given image, is a challenging task that requires models have sufficient cross-modal understanding based on both visual and textual context to answer the current question.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Chen¹,

Chen²,

Xu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the Vis-Dial dataset.

show abstract

“…(2) The pretraining model: VD-BERT [1] and VisDial-BERT [22]. (4) Graph-based models: GNN-EM [17], DualVD [19], FGA [18], GoG [6], KBGN [21].…”

Section: Baseline Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Chen¹,

Chen²,

Xu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, how to effectively realize the multi-modal representation learning and cross-modal semantic relation reasoning on rich underlying semantic structures of visual information and dialogue context is one of the key challenge. Researches propose to model images or videos and dialogue as the graph structure [10,34,203] and conduct cross attention-based reasoning [17,118,139] to perform fine-grained cross-modal relation reasoning for reasonable responses generation, see details in section 3.3.…”

Section: Research Challenges In Vadmentioning

confidence: 99%

“…Although above works have employed graph-based structure, their models still lack explicitly capturing complex relations within visual information or textual contexts. Chen et al [10] produce the graph-over-graph network (GoG), which consists of three cross modalities graph to capture relations and dependencies between query words, dialogue history and visual objects in imagebased dialogue. Then the high-level representation of cross-modal information is used to generate visually and contextually coherent responses.…”

Section: Graph-based Semantic Relationmentioning

confidence: 99%

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Wang¹,

Guo²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual-Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.CCS Concepts: • Human-centered computing → HCI theory, concepts and models; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

“…Visual Dialog (VD), which expects AI agents to conduct visually related dialog, has attracted growing interests due to its research significance and application prospects. Most of the work Niu et al, 2019;Gan et al, 2019;Chen et al, 2020;Agarwal et al, 2020;Nguyen et al, 2020;Chen et al, 2021) pays attention to modeling an Answerer agent. However, it is also important to model a VD Questioner agent that can constantly ask visually related and informative questions.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser

Zheng¹,

Xu²,

Meng³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

Considering the importance of building a good Visual Dialog (VD) Questioner, many researchers study the topic under a Q-Bot-A-Bot image-guessing game setting, where the Questioner needs to raise a series of questions to collect information of an undisclosed image. Despite progress has been made in Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist. Firstly, previous methods do not provide explicit and effective guidance for Questioner to generate visually related and informative questions. Secondly, the effect of RL is hampered by an incompetent component, i.e., the Guesser, who makes image predictions based on the generated dialogs and assigns rewards accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs; 2) we propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially. Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity. Human study further proves that our model generates more visually related, informative and coherent questions.

show abstract

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Cited by 23 publications

References 40 publications

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser

Contact Info

Product

Resources

About