Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1648
|View full text |Cite
|
Sign up to set email alerts
|

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Abstract: This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subs… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
64
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 92 publications
(64 citation statements)
references
References 57 publications
0
64
0
Order By: Relevance
“…Wu et al [ 10 ], Guo et al [ 4 ], and Yang et al [ 11 ] proposed a model for applying the co-attention mechanism among the three elements of current question, image, and past dialog history to determine the answer to the current question. Gan et al [ 3 ] proposed a model that repeats co-attention among the three elements several times. Idan et al [ 22 ] developed a factor graph-based attention framework, where nodes correspond to utilities and factors model their interactions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Wu et al [ 10 ], Guo et al [ 4 ], and Yang et al [ 11 ] proposed a model for applying the co-attention mechanism among the three elements of current question, image, and past dialog history to determine the answer to the current question. Gan et al [ 3 ] proposed a model that repeats co-attention among the three elements several times. Idan et al [ 22 ] developed a factor graph-based attention framework, where nodes correspond to utilities and factors model their interactions.…”
Section: Related Workmentioning
confidence: 99%
“…The existing models for visual dialog have been mostly implemented with a large monolithic neural network [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 ]. However, VQA and visual dialog are composable in nature in that the process of generating an answer to one natural language question can be completed by composing multiple basic neural network modules.…”
Section: Introductionmentioning
confidence: 99%
“…Multi-modal Dialogue Systems Recently, research on the dialog system has shifted towards integrating various modalities, such as images, audio, and video, along with text, to obtain the information to build a robust framework. The research reported in (Das et al, 2017;Mostafazadeh et al, 2017;De Vries et al, 2017;Gan et al, 2019) has been effective in narrowing the gap between vision and language. Similarly in (Le et al, 2019;Alamri et al, 2018;Lin et al, 2019a), DSTC7 dataset has been used for response generation by incorporating audio and visual features.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, research in dialogue systems has shifted towards incorporating different modalities such as images, audio and video for capturing information to make the robust systems. The research reported in [7,[35][36][37][38] has been effective in narrowing the gap between vision and language. In [36], an Image Grounded Conversations (IGC) task was proposed, in which naturalsounding conversations were generated about a shared image.…”
Section: Multimodal Dialogue Systemsmentioning
confidence: 99%