The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.542
|View full text |Cite
|
Sign up to set email alerts
|

Yin and Yang: Balancing and Answering Binary Visual Questions

Abstract: The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
238
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
5
3
1

Relationship

3
6

Authors

Journals

citations
Cited by 273 publications
(250 citation statements)
references
References 26 publications
2
238
0
Order By: Relevance
“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”
Section: Related Workmentioning
confidence: 99%
“…But recent works [6,47,49,16,18,1] have pointed out that language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content.…”
Section: Introductionmentioning
confidence: 99%
“…possible K answers and multiple-choice picks the answer that has the highest activation from the potential answers. [56]. The accuracy of our best model (deeper LSTM Q + norm I (Fig.…”
Section: Methodsmentioning
confidence: 99%
“…Humans are experts in communicating the reasoning process behind their answer to visual questions. For instance, on typical Visual Question Answering (VQA) samples [1,42,13], human annotators are often able to con-vincingly justify, in natural language, the reason behind their answer to a certain visual question using simple common sense reasoning. In contrast, deep Learning models are often viewed as black box predictors lacking interpretability in the sense that existing tools often fail to explain the decision process behind the models predictions.…”
Section: Introductionmentioning
confidence: 99%