Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Kang, Gi-Cheon; Lim, Jaeseo; Zhang, Byoung-Tak

doi:10.18653/v1/d19-1209

Cited by 69 publications

(46 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In other words, for the visual coreference resolution problem, the models of [ 5 , 9 ] operate a separate attention memory network at the sentence level and word level, respectively. However, the models of [ 8 , 6 ] adopted visual coreference resolution methods that first find the object indicated by a pronoun of the new natural language question in the past history of natural language dialogs before finding the corresponding visual attention map. However, existing studies on visual dialog including [ 5 ] have not attempted to process impersonal pronouns separately from general pronouns, unlike the model proposed in this study.…”

Section: Related Workmentioning

confidence: 99%

“…The existing models for visual dialog have been mostly implemented with a large monolithic neural network [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 ]. However, VQA and visual dialog are composable in nature in that the process of generating an answer to one natural language question can be completed by composing multiple basic neural network modules.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NMN-VD: A Neural Module Network for Visual Dialog

Cho

Kim

2021

Sensors

View full text Add to dashboard Cite

Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To overcome these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the modules required for deciding answers after analyzing input questions. In particular, the model includes a Refer module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling impersonal pronouns that do not require visual coreference resolution from general pronouns. Furthermore, a new Compare module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a Find module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

NMN-VD: A Neural Module Network for Visual Dialog

Cho

Kim

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…We also include some concurrent work for visual dialog that has not been discussed above, including image-questionanswer synergistic network (Guo et al, 2019), recursive visual attention (Niu et al, 2018), factor graph attention (Schwartz et al, 2019), dual attention network (Kang et al, 2019), graph neural network , history-advantage sequence training (Yang et al, 2019), and weighted likelihood estimation .…”

Section: Concurrent Workmentioning

confidence: 99%

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Gan

Cheng

Kholy

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subsequent step. On the VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-ofthe-art of 64.47% NDCG score. Visualization on the reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-bystep.

show abstract

“…Recent years have witnessed an increasing attention in visually grounded dialogues (Zarrieß et al, 2016;de Vries et al, 2018;Alamri et al, 2019;Narayan-Chen et al, 2019). Despite the impressive progress on benchmark scores and model architec-tures (Das et al, 2017b;Wu et al, 2018;Kottur et al, 2018;Gan et al, 2019;Shukla et al, 2019;Niu et al, 2019;Zheng et al, 2019;Kang et al, 2019;Murahari et al, 2019;Pang and Wang, 2020), there have also been critical problems pointed out in terms of dataset biases (Goyal et al, 2017;Chattopadhyay et al, 2017;Massiceti et al, 2018;Chen et al, 2018;Kottur et al, 2019;Kim et al, 2020;Agarwal et al, 2020) which obscure such contributions. For instance, Cirik et al (2018) points out that existing dataset of reference resolution may be largely solvable without recognizing the full referring expressions (e.g.…”

Section: Related Workmentioning

confidence: 99%

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Udagawa

Yamazaki

Aizawa

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (Udagawa and Aizawa, 2019, 2020), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on spatial expressions and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model's understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.

show abstract

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Cited by 69 publications

References 19 publications

NMN-VD: A Neural Module Network for Visual Dialog

NMN-VD: A Neural Module Network for Visual Dialog

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Contact Info

Product

Resources

About