2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00437
|View full text |Cite
|
Sign up to set email alerts
|

Why Does a Visual Question Have Different Answers?

Abstract: Visual question answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. To our knowledge, this is the first work that aims to understand why. We propose a taxonomy of nine plausible reasons, and create two labelled datasets consisting of ∼45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question whic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
53
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 40 publications
(58 citation statements)
references
References 39 publications
3
53
0
Order By: Relevance
“…On a higher level, it is important to note that what makes a "good" caption for users who are blind may not be ultimately deemed synonymous with what is deemed a "good" caption for AI researchers (as discussed in section 2.3). Our findings support previous work indicating that workers' subjective, different interpretations may be desirable [8,10,25,42,78,101,103]).…”
Section: Considering the Trade-offs Of Open-ended Captioning Taskssupporting
confidence: 91%
“…On a higher level, it is important to note that what makes a "good" caption for users who are blind may not be ultimately deemed synonymous with what is deemed a "good" caption for AI researchers (as discussed in section 2.3). Our findings support previous work indicating that workers' subjective, different interpretations may be desirable [8,10,25,42,78,101,103]).…”
Section: Considering the Trade-offs Of Open-ended Captioning Taskssupporting
confidence: 91%
“…Driven by VQA, several datasets have been proposed to minimize the bias observed in natural images (Goyal et al, 2017;Ray et al, 2019); to force models to "reason" over a joint visual and linguistic input Suhr et al, 2019); to deal with objects' attributes and relations (Krishna et al, 2017); to encompass more diverse (Zhu et al, 2016) and goal-oriented questions and answers (Gurari et al, 2018). At the same time, some work proposed higher-level evaluations of VQA models and showed their limitations (Hodosh and Hockenmaier, 2016;Shekhar et al, 2017); similarly, recent attention has been paid to understand what makes a question "difficult" for a model (Bhattacharya et al, 2019;Terao et al, 2020). Despite impressive progress, current approaches to VQA do not tackle one crucial limitation of the task: the answer to a question is given by the alignment of language and vision rather than their complementary integration.…”
Section: Related Workmentioning
confidence: 99%
“…Other studies that have addressed the issues of unanswerable visual questions include Toor et al ( 2017 ) and Bhattacharya et al ( 2019 ).…”
Section: Unanswerable Questions In Imagesmentioning
confidence: 99%