Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1348
|View full text |Cite
|
Sign up to set email alerts
|

Generating Question Relevant Captions to Aid Visual Question Answering

Abstract: Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 40 publications
(25 citation statements)
references
References 36 publications
0
25
0
Order By: Relevance
“…Another NLP model to produce explanations about an image is tackled by the problem of visual question answering [32], specially useful for the blind 3 or image captioning projects 456 . Generating questions that can be answered by a DNN's output caption can improve explainability and quality of image captioning models [33]. RQ 3.…”
Section: Visual Question Answering Modelsmentioning
confidence: 99%
“…Another NLP model to produce explanations about an image is tackled by the problem of visual question answering [32], specially useful for the blind 3 or image captioning projects 456 . Generating questions that can be answered by a DNN's output caption can improve explainability and quality of image captioning models [33]. RQ 3.…”
Section: Visual Question Answering Modelsmentioning
confidence: 99%
“…This may be because the visual ImageNet feature is struggling to learn models for answering textual questions. Furthermore, the concepts encode richer information in that the words have fewer structural constraints and can easily include the attributes of and relation among multiple objects, as revealed in [47]. STAGE [27] is equipped with additional temporal supervision and a combination of global and local features, however, achieving lower results than ours.…”
Section: Modelsmentioning
confidence: 99%
“…To overcome these shortcomings, we propose to augment a deep learning architecture that utilizes neural attention with additional, external knowledge about the image. This type of approach has been used in the past (You et al 2016;Wu et al 2016;Kim and Bansal 2019;Wu, Hu, and Mooney 2019); however, our work seeks to take advantage of a different form of knowledge. Our resulting network, which we call the Visual Question Answering-Contextual Information network (VQA-CoIn), improves upon past work by extending it to incorporate semantic information extracted from every regions an image via image descriptions.…”
Section: Introductionmentioning
confidence: 99%