DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

Jiang, Xiaoze; Yu, Jing; Sun, Yajing; Qin, Zengchang; Zhu, Zihao; Hu, Yue; Wu, Qi

doi:10.48550/arxiv.2007.03310

Cited by 2 publications

(2 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VisDial Similar to the previous work (Kang et al 2023), we compare the performance of our method with 10 baselines: 1) Attention-based models: CoAtt (Wu et al 2018), HCIAE (Lu et al 2017), Primary (Guo, Xu, and Tao 2019), ReDAN (Gan et al 2019), DMRM (Chen et al 2020a), DAM (Jiang et al 2020b) 2) Graph-based models: KBGN (Jiang et al 2020a), LTMI (Nguyen, Suganuma, and Okatani 2020), LTMI-GoG (Chen et al 2021) 3) Semi-supervised learning model: GST (Kang et al 2023).…”

Section: Baselinesmentioning

confidence: 99%

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Kim,

et al. 2024

AAAI

View full text Add to dashboard Cite

With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.

show abstract

Section: Baselinesmentioning

confidence: 99%

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Kim,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Then attention-based models (Lu et al, 2017;Wu et al, 2018;Kottur et al, 2018) are proposed to dynamically attend to spatial image features in order to find related visual content. Furthermore, models based on object-level image features Gan et al, 2019;Chen et al, 2020a;Jiang et al, 2020a;Nguyen et al, 2020;Jiang et al, 2020b) are proposed to effectively leverage the visual content for multimodal co-reference. However, as implicit exploration of multimodal co-reference, these methods implicitly attend to spatial or object-level image features, which is trained with the whole model and is inevitably distracted by unnecessary visual content.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Chen¹,

Meng²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal coreference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a Multimodal Incremental Transformer with Visual Grounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

show abstract

DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

Cited by 2 publications

References 9 publications

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Contact Info

Product

Resources

About