Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Gan, Zhe; Cheng, Yu; Kholy, Ahmed El; Fu, Pingqing; Liu, Jingjing; Gao, Jianfeng

doi:10.18653/v1/p19-1648

Cited by 92 publications

(64 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wu et al [ 10 ], Guo et al [ 4 ], and Yang et al [ 11 ] proposed a model for applying the co-attention mechanism among the three elements of current question, image, and past dialog history to determine the answer to the current question. Gan et al [ 3 ] proposed a model that repeats co-attention among the three elements several times. Idan et al [ 22 ] developed a factor graph-based attention framework, where nodes correspond to utilities and factors model their interactions.…”

Section: Related Workmentioning

confidence: 99%

“…The existing models for visual dialog have been mostly implemented with a large monolithic neural network [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 ]. However, VQA and visual dialog are composable in nature in that the process of generating an answer to one natural language question can be completed by composing multiple basic neural network modules.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NMN-VD: A Neural Module Network for Visual Dialog

Cho

Kim

2021

Sensors

View full text Add to dashboard Cite

Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To overcome these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the modules required for deciding answers after analyzing input questions. In particular, the model includes a Refer module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling impersonal pronouns that do not require visual coreference resolution from general pronouns. Furthermore, a new Compare module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a Find module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

NMN-VD: A Neural Module Network for Visual Dialog

Cho

Kim

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Multi-modal Dialogue Systems Recently, research on the dialog system has shifted towards integrating various modalities, such as images, audio, and video, along with text, to obtain the information to build a robust framework. The research reported in (Das et al, 2017;Mostafazadeh et al, 2017;De Vries et al, 2017;Gan et al, 2019) has been effective in narrowing the gap between vision and language. Similarly in (Le et al, 2019;Alamri et al, 2018;Lin et al, 2019a), DSTC7 dataset has been used for response generation by incorporating audio and visual features.…”

Section: Related Workmentioning

confidence: 99%

MultiDM-GCN: Aspect-guided Response Generation in Multi-domain Multi-modal Dialogue System using Graph Convolutional Network

Firdaus¹,

Thakur²,

Ekbal³

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

In the recent past, dialogue systems have gained immense popularity and have become ubiquitous. During conversations, humans not only rely on languages but seek contextual information through visual contents as well. In every task-oriented dialogue system, the user is guided by the different aspects of a product or service that regulates the conversation towards selecting the product or service. In this work, we present a multi-modal conversational framework for a task-oriented dialogue setup that generates the responses following the different aspects of a product or service to cater to the user's needs. We show that the responses guided by the aspect information provide more interactive and informative responses for better communication between the agent and the user. We first create a Multi-domain Multimodal Dialogue (MDMMD) dataset having conversations involving both text and images belonging to the three different domains, such as restaurants, electronics, and furniture. We implement a Graph Convolutional Network (GCN) based framework that generates appropriate textual responses from the multi-modal inputs. The multi-modal information having both textual and image representation is fed to the decoder and the aspect information for generating aspect guided responses. Quantitative and qualitative analyses show that the proposed methodology outperforms several baselines for the proposed task of aspect-guided response generation.

show abstract

“…Recently, research in dialogue systems has shifted towards incorporating different modalities such as images, audio and video for capturing information to make the robust systems. The research reported in [7,[35][36][37][38] has been effective in narrowing the gap between vision and language. In [36], an Image Grounded Conversations (IGC) task was proposed, in which naturalsounding conversations were generated about a shared image.…”

Section: Multimodal Dialogue Systemsmentioning

confidence: 99%

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

2020

View full text Add to dashboard Cite

Multimodal dialogue system, due to its many-fold applications, has gained much attention to the researchers and developers in recent times. With the release of large-scale multimodal dialog dataset Saha et al. 2018 on the fashion domain, it has been possible to investigate the dialogue systems having both textual and visual modalities. Response generation is an essential aspect of every dialogue system, and making the responses diverse is an important problem. For any goal-oriented conversational agent, the system’s responses must be informative, diverse and polite, that may lead to better user experiences. In this paper, we propose an end-to-end neural framework for generating varied responses in a multimodal dialogue setup capturing information from both the text and image. Multimodal encoder with co-attention between the text and image is used for focusing on the different modalities to obtain better contextual information. For effective information sharing across the modalities, we combine the information of text and images using the BLOCK fusion technique that helps in learning an improved multimodal representation. We employ stochastic beam search with Gumble Top K-tricks to achieve diversified responses while preserving the content and politeness in the responses. Experimental results show that our proposed approach performs significantly better compared to the existing and baseline methods in terms of distinct metrics, and thereby generates more diverse responses that are informative, interesting and polite without any loss of information. Empirical evaluation also reveals that images, while used along with the text, improve the efficiency of the model in generating diversified responses.

show abstract

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Cited by 92 publications

References 57 publications

NMN-VD: A Neural Module Network for Visual Dialog

NMN-VD: A Neural Module Network for Visual Dialog

MultiDM-GCN: Aspect-guided Response Generation in Multi-domain Multi-modal Dialogue System using Graph Convolutional Network

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

Contact Info

Product

Resources

About