DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

Jiang, Xiaoze; Yu, Jing; Sun, Yajing; Qin, Zengchang; Zhu, Zihao; Hu, Yue; Wu, Qi

doi:10.24963/ijcai.2020/96

Cited by 19 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ReDAN [9] adopts multi-step reasoning and outperforms our model on some metrics. DMRM [4] and DAM [16] achieve higher performance by designing a more complex generative decoder. HACAN [46] introduces multihead attention and two-stage training, achieving comparable results with us.…”

Section: Overall Resultsmentioning

confidence: 99%

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Jiang

Qin

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and intermodal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms existing models with stateof-the-art results. CCS CONCEPTS • Computing methodologies → Visual content-based indexing and retrieval.

show abstract

Section: Overall Resultsmentioning

confidence: 99%

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Jiang

Qin

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…RvA (Niu et al, 2019), DVAN (Guo et al, 2019b) and DMRM (Chen et al, 2020a), DAM (Jiang et al, 2020c).…”

Section: Modelmentioning

confidence: 99%

Learning to Ground Visual Objects for Visual Dialog

Chen¹,

Chen²,

Xu³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, however these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process. Meanwhile, a prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounded even without answers during the inference process. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that our approach improves the previous strong models in both generative and discriminative settings by a significant margin.

show abstract

“…The visual dialogue task was proposed by Das et al [6], and requires an agent to answer multi-round questions about a static image [7,18,20]. Previous work [12,19,21,24,41,57,61] focused on developing different attention mechanisms to model the interactions among image, question, and dialogue history [56].…”

Section: Visual Dialoguementioning

confidence: 99%

Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation

Shen,

Zhan,

Shen

et al. 2021

Preprint

View full text Add to dashboard Cite

Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided directly during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get those RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines. CCS CONCEPTS• Computing methodologies → Discourse, dialogue and pragmatics; Natural language generation;

show abstract

DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

Cited by 19 publications

References 0 publications

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Learning to Ground Visual Objects for Visual Dialog

Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation

Contact Info

Product

Resources

About