Why Does a Visual Question Have Different Answers?

Bhattacharya, Nilava; Li, Qing; Gurari, Danna

doi:10.1109/iccv.2019.00437

Cited by 40 publications

(58 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On a higher level, it is important to note that what makes a "good" caption for users who are blind may not be ultimately deemed synonymous with what is deemed a "good" caption for AI researchers (as discussed in section 2.3). Our findings support previous work indicating that workers' subjective, different interpretations may be desirable [8,10,25,42,78,101,103]).…”

Section: Considering the Trade-offs Of Open-ended Captioning Taskssupporting

confidence: 91%

"I Hope This Is Helpful"

Simons

Gurari

Fleischmann

2020

Proc. ACM Hum.-Comput. Interact.

Self Cite

View full text Add to dashboard Cite

AI image captioning challenges encourage broad participation in designing algorithms that automatically create captions for a variety of images and users. To create large datasets necessary for these challenges, researchers typically employ a shared crowdsourcing task design for image captioning. This paper discusses findings from our thematic analysis of 1,064 comments left by Amazon Mechanical Turk workers using this task design to create captions for images taken by people who are blind. Workers discussed difficulties in understanding how to complete this task, provided suggestions of how to improve the task, gave explanations or clarifications about their work, and described why they found this particular task rewarding or interesting. Our analysis provides insights both into this particular genre of task as well as broader considerations for how to employ crowdsourcing to generate large datasets for developing AI algorithms.

show abstract

Section: Considering the Trade-offs Of Open-ended Captioning Taskssupporting

confidence: 91%

"I Hope This Is Helpful"

Simons

Gurari

Fleischmann

2020

Proc. ACM Hum.-Comput. Interact.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Driven by VQA, several datasets have been proposed to minimize the bias observed in natural images (Goyal et al, 2017;Ray et al, 2019); to force models to "reason" over a joint visual and linguistic input Suhr et al, 2019); to deal with objects' attributes and relations (Krishna et al, 2017); to encompass more diverse (Zhu et al, 2016) and goal-oriented questions and answers (Gurari et al, 2018). At the same time, some work proposed higher-level evaluations of VQA models and showed their limitations (Hodosh and Hockenmaier, 2016;Shekhar et al, 2017); similarly, recent attention has been paid to understand what makes a question "difficult" for a model (Bhattacharya et al, 2019;Terao et al, 2020). Despite impressive progress, current approaches to VQA do not tackle one crucial limitation of the task: the answer to a question is given by the alignment of language and vision rather than their complementary integration.…”

Section: Related Workmentioning

confidence: 99%

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

Pezzelle¹,

Greco²,

Gandolfi³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

show abstract

“…Other studies that have addressed the issues of unanswerable visual questions include Toor et al ( 2017 ) and Bhattacharya et al ( 2019 ).…”

Section: Unanswerable Questions In Imagesmentioning

confidence: 99%