Visual7W: Grounded Question Answering in Images

Zhu, Yuke; Groth, Oliver; Bernstein, Michael S.; Li, Feifei

doi:10.1109/cvpr.2016.540

Cited by 654 publications

(591 citation statements)

References 43 publications

Supporting

Mentioning

571

Contrasting

Order By: Relevance

“…The VQA dataset [1], among widely used benchmarks, is a collection of diverse free form open ended questions. Visual7w [12] is a dataset with the goal of providing semantic links between textual descriptions and image regions by means of object-level grounding. FVQA [13] primarily contains questions that require external information to answer.…”

Section: Related Work a Vqa Datasetsmentioning

confidence: 99%

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

Alizadeh

Eugenio

2020

2020 IEEE 14th International Conference on Semantic Computing (ICSC)

View full text Add to dashboard Cite

Visual Question Answering (VQA) concerns providing answers to Natural Language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, if the question focuses on events described by verbs, the language understanding component becomes crucial. Our hypothesis is that models should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. Our first contribution is a new VQA dataset (imSituVQA) that we built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, we propose a multitask CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.

show abstract

Section: Related Work a Vqa Datasetsmentioning

confidence: 99%

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

Alizadeh

Eugenio

2020

2020 IEEE 14th International Conference on Semantic Computing (ICSC)

View full text Add to dashboard Cite

show abstract

“…It is worth noting that the work of [23] also used the questions from VQA dataset [ for training purpose, whereas the work by [38] uses only the VQG-COCO dataset. We understand that the size of this dataset is small and there are other datasets like VQA [1], Visual7W [66] and Visual Genome [29] which have thousands of images and questions. But, VQA questions are mainly visually grounded and literal, Visual7w questions are designed to be answerable by only the image, and questions in Visual Genome focus on cognitive tasks, making them unnatural for asking a human [38] and hence not suited for the VQG task.…”

Section: Datasetmentioning

confidence: 99%

Multimodal Differential Network for Visual Question Generation

Patro¹,

Kumar²,

Kurmi³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Generating natural questions from an image is a semantic task that requires using vision and language modalities to learn multimodal representations. Images can have multiple visual and language cues such as places, captions, and tags. In this paper, we propose a principled deep Bayesian learning framework that combines these cues to produce natural questions. We observe that with the addition of more cues and by minimizing uncertainty in the among cues, the Bayesian network becomes more confident. We propose a Minimizing Uncertainty of Mixture of Cues (MUMC), that minimizes uncertainty present in a mixture of cues experts for generating probabilistic questions. This is a Bayesian framework and the results show a remarkable similarity to natural questions as validated by a human study. We observe that with the addition of more cues and by minimizing uncertainty among the cues, the Bayesian framework becomes more confident. Ablation studies of our model indicate that a subset of cues is inferior at this task and hence the principled fusion of cues is preferred. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU-n, METEOR, ROUGE, and CIDEr). Here we provide project link for Deep Bayesian VQG https: //delta-lab-iitk.github.io/BVQG/.

show abstract

“…The output of these two LSTMs are then fed to a fully connected layer to predict the question. In Zhu et al (2015) the model actually learns which region of the image to attend rather than feeding the model any specific region of the image. Here the LSTM is fed with the CNN feature of the whole image and the question word by word.…”

Section: Literature Reviewmentioning

confidence: 99%

“…We did our experimentation on the Visual7W Dataset which was introduced by Zhu et al (2015). Visual7W is named after the seven categories of questions it contains: What, Where, How, When, Who, Why, and Which.…”

Section: Datasetmentioning

confidence: 99%

Segmentation Guided Attention Networks for Visual Question Answering

Sharma

Bishnu²,

Patel³

2017

Proceedings of ACL 2017, Student Research Workshop

View full text Add to dashboard Cite

In this paper we propose to solve the problem of Visual Question Answering by using a novel segmentation guided attention based network which we call SegAttendNet. We use image segmentation maps, generated by a Fully Convolutional Deep Neural Network to refine our attention maps and use these refined attention maps to make the model focus on the relevant parts of the image to answer a question. The refined attention maps are used by the LSTM network to learn to produce the answer. We presently train our model on the visual7W dataset and do a category wise evaluation of the 7 question categories. We achieve state of the art results on this dataset and beat the previous benchmark on this dataset by a 1.5% margin improving the question answering accuracy from 54.1% to 55.6% and demonstrate improvements in each of the question categories. We also visualize our generated attention maps and note their improvement over the attention maps generated by the previous best approach.

show abstract

Visual7W: Grounded Question Answering in Images

Cited by 654 publications

References 43 publications

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

Multimodal Differential Network for Visual Question Generation

Segmentation Guided Attention Networks for Visual Question Answering

Contact Info

Product

Resources

About