2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.540
|View full text |Cite
|
Sign up to set email alerts
|

Visual7W: Grounded Question Answering in Images

Abstract: We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
571
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 654 publications
(591 citation statements)
references
References 43 publications
4
571
0
Order By: Relevance
“…The VQA dataset [1], among widely used benchmarks, is a collection of diverse free form open ended questions. Visual7w [12] is a dataset with the goal of providing semantic links between textual descriptions and image regions by means of object-level grounding. FVQA [13] primarily contains questions that require external information to answer.…”
Section: Related Work a Vqa Datasetsmentioning
confidence: 99%
“…The VQA dataset [1], among widely used benchmarks, is a collection of diverse free form open ended questions. Visual7w [12] is a dataset with the goal of providing semantic links between textual descriptions and image regions by means of object-level grounding. FVQA [13] primarily contains questions that require external information to answer.…”
Section: Related Work a Vqa Datasetsmentioning
confidence: 99%
“…It is worth noting that the work of [23] also used the questions from VQA dataset [ for training purpose, whereas the work by [38] uses only the VQG-COCO dataset. We understand that the size of this dataset is small and there are other datasets like VQA [1], Visual7W [66] and Visual Genome [29] which have thousands of images and questions. But, VQA questions are mainly visually grounded and literal, Visual7w questions are designed to be answerable by only the image, and questions in Visual Genome focus on cognitive tasks, making them unnatural for asking a human [38] and hence not suited for the VQG task.…”
Section: Datasetmentioning
confidence: 99%
“…The output of these two LSTMs are then fed to a fully connected layer to predict the question. In Zhu et al (2015) the model actually learns which region of the image to attend rather than feeding the model any specific region of the image. Here the LSTM is fed with the CNN feature of the whole image and the question word by word.…”
Section: Literature Reviewmentioning
confidence: 99%
“…We did our experimentation on the Visual7W Dataset which was introduced by Zhu et al (2015). Visual7W is named after the seven categories of questions it contains: What, Where, How, When, Who, Why, and Which.…”
Section: Datasetmentioning
confidence: 99%