“…The task of visual question answering (VQA) relates visual concepts with elements of language and, occasionally, common-sense or general knowledge. Examples of training questions and their correct answer from the VQA v2 dataset [14].…”
Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning
confidence: 99%
“…on the VQA v2 benchmark [14]. Admittedly, a large part of such a search is necessarily guided by empirical exploration and validation.…”
Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning
confidence: 99%
“…Datasets A number of large-scale datasets for VQA have been created (e.g. [6,14,24,38]; see [33] for a survey). Each dataset contains various images, typically from Flickr and/or from the COCO dataset [25], together with humanproposed questions and ground truth answers.…”
Section: Introductionmentioning
confidence: 99%
“…[6] has served as the de facto benchmark since its introduction in 2015. As the performance of methods improved, however, it became apparent that language-based priors and rote-learning of example questions/answers were overly effective ways to obtain good performance [14,18,37]. That fact hinders the effective evaluation and comparison of competing methods.…”
Section: Introductionmentioning
confidence: 99%
“…That fact hinders the effective evaluation and comparison of competing methods. The observation led to the introduction of a new version of the dataset, referred to as VQA v2 [14]. It associates two images to every question.…”
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of significant importance for research in artificial intelligence, given its multimodal nature, clear evaluation protocol, and potential real-world applications. The performance of deep neural networks for VQA is very dependent on choices of architectures and hyperparameters. To help further research in the area, we describe in detail our high-performing, though relatively simple model. Through a massive exploration of architectures and hyperparameters representing more than 3,000 GPU-hours, we identified tips and tricks that lead to its success, namely: sigmoid outputs, soft training targets, image features from bottom-up attention, gated tanh activations, output embeddings initialized using GloVe and Google Images, large mini-batches, and smart shuffling of training data. We provide a detailed analysis of their impact on performance to assist others in making an appropriate selection.
“…The task of visual question answering (VQA) relates visual concepts with elements of language and, occasionally, common-sense or general knowledge. Examples of training questions and their correct answer from the VQA v2 dataset [14].…”
Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning
confidence: 99%
“…on the VQA v2 benchmark [14]. Admittedly, a large part of such a search is necessarily guided by empirical exploration and validation.…”
Section: What Is On the Coffee Table ? What Color Is The Hydrant ? Ca...mentioning
confidence: 99%
“…Datasets A number of large-scale datasets for VQA have been created (e.g. [6,14,24,38]; see [33] for a survey). Each dataset contains various images, typically from Flickr and/or from the COCO dataset [25], together with humanproposed questions and ground truth answers.…”
Section: Introductionmentioning
confidence: 99%
“…[6] has served as the de facto benchmark since its introduction in 2015. As the performance of methods improved, however, it became apparent that language-based priors and rote-learning of example questions/answers were overly effective ways to obtain good performance [14,18,37]. That fact hinders the effective evaluation and comparison of competing methods.…”
Section: Introductionmentioning
confidence: 99%
“…That fact hinders the effective evaluation and comparison of competing methods. The observation led to the introduction of a new version of the dataset, referred to as VQA v2 [14]. It associates two images to every question.…”
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of significant importance for research in artificial intelligence, given its multimodal nature, clear evaluation protocol, and potential real-world applications. The performance of deep neural networks for VQA is very dependent on choices of architectures and hyperparameters. To help further research in the area, we describe in detail our high-performing, though relatively simple model. Through a massive exploration of architectures and hyperparameters representing more than 3,000 GPU-hours, we identified tips and tricks that lead to its success, namely: sigmoid outputs, soft training targets, image features from bottom-up attention, gated tanh activations, output embeddings initialized using GloVe and Google Images, large mini-batches, and smart shuffling of training data. We provide a detailed analysis of their impact on performance to assist others in making an appropriate selection.
The predominant approach to Visual Question Answering (VQA) demands that the model represents within its weights all of the information required to answer any question about any image. Learning this information from any real training set seems unlikely, and representing it in a reasonable number of weights doubly so. We propose instead to approach VQA as a meta learning task, thus separating the question answering method from the information required. At test time, the method is provided with a support set of example questions/answers, over which it reasons to resolve the given question. The support set is not fixed and can be extended without retraining, thereby expanding the capabilities of the model. To exploit this dynamically provided information, we adapt a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks. Experiments demonstrate the capability of the system to learn to produce completely novel answers (i.e. never seen during training) from examples provided at test time. In comparison to the existing state of the art, the proposed method produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data. More importantly, it represents an important step towards vision-and-language methods that can learn and reason on-the-fly.
The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.