The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2018
DOI: 10.1007/s11263-018-1116-0
|View full text |Cite
|
Sign up to set email alerts
|

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Abstract: Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
742
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 419 publications
(750 citation statements)
references
References 49 publications
4
742
0
Order By: Relevance
“…(3) We execute extensive ablation studies for each component of QBN and achieve state-of-the-art performance on VQA v2.0 [6]. Surprisingly, our proposed QBN can even surpass BERT retrained models like VilBERT.…”
Section: Mcan Interactionmentioning
confidence: 99%
“…(3) We execute extensive ablation studies for each component of QBN and achieve state-of-the-art performance on VQA v2.0 [6]. Surprisingly, our proposed QBN can even surpass BERT retrained models like VilBERT.…”
Section: Mcan Interactionmentioning
confidence: 99%
“…VQA consisting of open-ended questions and both real and abstract scenes [44], [234]. A VQA Challenge based on these data sets is held annually as a CVPR workshop since 2016.…”
Section: Visual Question Answering 1) Task Definitionmentioning
confidence: 99%
“…Specifically, VQA takes an image and a corresponding natural language question as input and outputs the answer. It is a classification problem in which candidate answers are restricted to the most common answers appearing in the dataset and requires deep analysis and understanding of images and questions such as image recognition and object localization [16,27,38,42]. Current models can be classified into three main categories: early fusion models, later fusion models, and external knowledge-based models.…”
Section: Related Workmentioning
confidence: 99%