Yin and Yang: Balancing and Answering Binary Visual Questions

Zhang, Peng; Goyal, Yash; Summers-Stay, Douglas; Batra, Dhruv; Parikh, Devi

doi:10.1109/cvpr.2016.542

Cited by 273 publications

(250 citation statements)

References 26 publications

Supporting

Mentioning

238

Contrasting

Order By: Relevance

“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

1,575

1,372

View full text Add to dashboard Cite

Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset [3] by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0).We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners.Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counterexample based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users. * The first two authors contributed equally.

show abstract

Section: Related Workmentioning

confidence: 99%

“…But recent works [6,47,49,16,18,1] have pointed out that language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content.…”

Section: Introductionmentioning

confidence: 99%

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

1,575

1,372

View full text Add to dashboard Cite

show abstract

“…possible K answers and multiple-choice picks the answer that has the highest activation from the potential answers. [56]. The accuracy of our best model (deeper LSTM Q + norm I (Fig.…”

Section: Methodsmentioning

confidence: 99%

VQA: Visual Question Answering

et al. 2016

Self Cite

View full text Add to dashboard Cite

Abstract-We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

show abstract

“…Humans are experts in communicating the reasoning process behind their answer to visual questions. For instance, on typical Visual Question Answering (VQA) samples [1,42,13], human annotators are often able to con-vincingly justify, in natural language, the reason behind their answer to a certain visual question using simple common sense reasoning. In contrast, deep Learning models are often viewed as black box predictors lacking interpretability in the sense that existing tools often fail to explain the decision process behind the models predictions.…”

Section: Introductionmentioning

confidence: 99%

Assisting human experts in the interpretation of their visual process: A case study on assessing copper surface adhesive potency

Hascoet

Deng

Tai

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Deep Neural Networks are often though to lack interpretability due to the distributed nature of their internal representations. In contrast, humans can generally justify, in natural language, for their answer to a visual question with simple common sense reasoning. However, human introspection abilities have their own limits as one often struggles to justify for the recognition process behind our lowest level feature recognition ability: for instance, it is difficult to precisely explain why a given texture seems more characteristic of the surface of a finger nail rather than a plastic bottle. In this paper, we showcase an application in which deep learning models can actually help human experts justify for their own low-level visual recognition process: We study the problem of assessing the adhesive potency of copper sheets from microscopic pictures of their surface. Although highly trained material experts are able to qualitatively assess the surface adhesive potency, they are often unable to precisely justify for their decision process. We present a model that, under careful design considerations, is able to provide visual clues for human experts to understand and justify for their own recognition process. Not only can our model assist human experts in their interpretation of the surface characteristics, we show how this model can be used to test different hypothesis of the copper surface response to different manufacturing processes.

show abstract

Yin and Yang: Balancing and Answering Binary Visual Questions

Cited by 273 publications

References 26 publications

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

VQA: Visual Question Answering

Assisting human experts in the interpretation of their visual process: A case study on assessing copper surface adhesive potency

Contact Info

Product

Resources

About