2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00522
|View full text |Cite
|
Sign up to set email alerts
|

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Abstract: A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
714
2

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 465 publications
(719 citation statements)
references
References 25 publications
3
714
2
Order By: Relevance
“…A similar phenomenon was observed in reading comprehension, where systems performed non-trivially well by using only the final sentence in the passage or ignoring the passage altogether (Kaushik & Lipton, 2018). Finally, multiple studies found nontrivial performance in visual question answering (VQA) by using only the question, without access to the image, due to question biases Kafle & Kanan, 2016, 2017Goyal et al, 2017;Agrawal et al, 2017).…”
Section: Fine-tuning On Target Datasetsmentioning
confidence: 58%
“…A similar phenomenon was observed in reading comprehension, where systems performed non-trivially well by using only the final sentence in the passage or ignoring the passage altogether (Kaushik & Lipton, 2018). Finally, multiple studies found nontrivial performance in visual question answering (VQA) by using only the question, without access to the image, due to question biases Kafle & Kanan, 2016, 2017Goyal et al, 2017;Agrawal et al, 2017).…”
Section: Fine-tuning On Target Datasetsmentioning
confidence: 58%
“…It could also be interesting to extend this generic approach to other forms of captioning such as visual storytelling [38] or stylized captioning [56] by utilizing the easily available and weakly labelled data from the web. 1…”
Section: Resultsmentioning
confidence: 99%
“…In [11], the authors define that two captions are different if the ratio of common words between them is smaller than a threshold (3% used in the paper). In [3], from the set of all the candidate captions, the authors compute the number of unique n-grams (1,2,4) at each position starting from the beginning up to position 13. We plot diversity using [11] in Figure 5d.…”
Section: Diversitymentioning
confidence: 99%
“…As a consequence, even "blind" model can achieve satisfying results without well understanding the questions and images. Many efforts, such as building more balanced datasets [120], [121] and enforcing more transparent model designs, have been made to alleviate this issue. Multi-modal fusion.…”
Section: B Exemplar Applications Of Data and Knowledge Fusionmentioning
confidence: 99%