Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal, Yash; Khot, Tejas; Summers-Stay, Douglas; Batra, Dhruv; Parikh, Devi

doi:10.48550/arxiv.1612.00837

Cited by 25 publications

(110 citation statements)

References 0 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…The task of visual question answering (VQA) relates visual concepts with elements of language and, occasionally, common-sense or general knowledge. Examples of training questions and their correct answer from the VQA v2 dataset [14].…”

Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning

confidence: 99%

“…on the VQA v2 benchmark [14]. Admittedly, a large part of such a search is necessarily guided by empirical exploration and validation.…”

Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning

confidence: 99%

“…Datasets A number of large-scale datasets for VQA have been created (e.g. [6,14,24,38]; see [33] for a survey). Each dataset contains various images, typically from Flickr and/or from the COCO dataset [25], together with humanproposed questions and ground truth answers.…”

Section: Introductionmentioning

confidence: 99%

“…[6] has served as the de facto benchmark since its introduction in 2015. As the performance of methods improved, however, it became apparent that language-based priors and rote-learning of example questions/answers were overly effective ways to obtain good performance [14,18,37]. That fact hinders the effective evaluation and comparison of competing methods.…”

Section: Introductionmentioning

confidence: 99%

“…That fact hinders the effective evaluation and comparison of competing methods. The observation led to the introduction of a new version of the dataset, referred to as VQA v2 [14]. It associates two images to every question.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Teney

Anderson

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

327

275

View full text Add to dashboard Cite

This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of significant importance for research in artificial intelligence, given its multimodal nature, clear evaluation protocol, and potential real-world applications. The performance of deep neural networks for VQA is very dependent on choices of architectures and hyperparameters. To help further research in the area, we describe in detail our high-performing, though relatively simple model. Through a massive exploration of architectures and hyperparameters representing more than 3,000 GPU-hours, we identified tips and tricks that lead to its success, namely: sigmoid outputs, soft training targets, image features from bottom-up attention, gated tanh activations, output embeddings initialized using GloVe and Google Images, large mini-batches, and smart shuffling of training data. We provide a detailed analysis of their impact on performance to assist others in making an appropriate selection.

show abstract

Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning

confidence: 99%

“…on the VQA v2 benchmark [14]. Admittedly, a large part of such a search is necessarily guided by empirical exploration and validation.…”

Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Teney

Anderson

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

327

275

View full text Add to dashboard Cite

show abstract

Visual Question Answering as a Meta Learning Task

Teney

Hengel

2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

The predominant approach to Visual Question Answering (VQA) demands that the model represents within its weights all of the information required to answer any question about any image. Learning this information from any real training set seems unlikely, and representing it in a reasonable number of weights doubly so. We propose instead to approach VQA as a meta learning task, thus separating the question answering method from the information required. At test time, the method is provided with a support set of example questions/answers, over which it reasons to resolve the given question. The support set is not fixed and can be extended without retraining, thereby expanding the capabilities of the model. To exploit this dynamically provided information, we adapt a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks. Experiments demonstrate the capability of the system to learn to produce completely novel answers (i.e. never seen during training) from examples provided at test time. In comparison to the existing state of the art, the proposed method produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data. More importantly, it represents an important step towards vision-and-language methods that can learn and reason on-the-fly.

show abstract

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Gurari

Stangl

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

342

277

View full text Add to dashboard Cite

The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people.

show abstract

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Cited by 25 publications

References 0 publications

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Visual Question Answering as a Meta Learning Task

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Contact Info

Product

Resources

About