Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1209
|View full text |Cite
|
Sign up to set email alerts
|

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Abstract: Visual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Du… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
46
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 69 publications
(46 citation statements)
references
References 19 publications
0
46
0
Order By: Relevance
“…In other words, for the visual coreference resolution problem, the models of [ 5 , 9 ] operate a separate attention memory network at the sentence level and word level, respectively. However, the models of [ 8 , 6 ] adopted visual coreference resolution methods that first find the object indicated by a pronoun of the new natural language question in the past history of natural language dialogs before finding the corresponding visual attention map. However, existing studies on visual dialog including [ 5 ] have not attempted to process impersonal pronouns separately from general pronouns, unlike the model proposed in this study.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In other words, for the visual coreference resolution problem, the models of [ 5 , 9 ] operate a separate attention memory network at the sentence level and word level, respectively. However, the models of [ 8 , 6 ] adopted visual coreference resolution methods that first find the object indicated by a pronoun of the new natural language question in the past history of natural language dialogs before finding the corresponding visual attention map. However, existing studies on visual dialog including [ 5 ] have not attempted to process impersonal pronouns separately from general pronouns, unlike the model proposed in this study.…”
Section: Related Workmentioning
confidence: 99%
“…The existing models for visual dialog have been mostly implemented with a large monolithic neural network [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 ]. However, VQA and visual dialog are composable in nature in that the process of generating an answer to one natural language question can be completed by composing multiple basic neural network modules.…”
Section: Introductionmentioning
confidence: 99%
“…We also include some concurrent work for visual dialog that has not been discussed above, including image-questionanswer synergistic network (Guo et al, 2019), recursive visual attention (Niu et al, 2018), factor graph attention (Schwartz et al, 2019), dual attention network (Kang et al, 2019), graph neural network , history-advantage sequence training (Yang et al, 2019), and weighted likelihood estimation .…”
Section: Concurrent Workmentioning
confidence: 99%
“…Recent years have witnessed an increasing attention in visually grounded dialogues (Zarrieß et al, 2016;de Vries et al, 2018;Alamri et al, 2019;Narayan-Chen et al, 2019). Despite the impressive progress on benchmark scores and model architec-tures (Das et al, 2017b;Wu et al, 2018;Kottur et al, 2018;Gan et al, 2019;Shukla et al, 2019;Niu et al, 2019;Zheng et al, 2019;Kang et al, 2019;Murahari et al, 2019;Pang and Wang, 2020), there have also been critical problems pointed out in terms of dataset biases (Goyal et al, 2017;Chattopadhyay et al, 2017;Massiceti et al, 2018;Chen et al, 2018;Kottur et al, 2019;Kim et al, 2020;Agarwal et al, 2020) which obscure such contributions. For instance, Cirik et al (2018) points out that existing dataset of reference resolution may be largely solvable without recognizing the full referring expressions (e.g.…”
Section: Related Workmentioning
confidence: 99%