VQA: Visual Question Answering

Agrawal, Aishwarya; Lü, Jing; Antol, Stanislaw; Mitchell, Margaret; Zitnick, C. Lawrence; Parikh, Devi; Batra, Dhruv

doi:10.1007/s11263-016-0966-6

Cited by 306 publications

(195 citation statements)

References 39 publications

Supporting

Mentioning

177

Contrasting

Unclassified

Order By: Relevance

“…respectively. 3 We begin from a seed set of 250 manually constructed patterns, and extend it with 274 natural patterns derived from VQA1.0 [4] through templatization of words from our ontology. 4 To increase the question diversity, apart from using synonyms for objects and attributes, we incorporate probabilistic sections into the patterns, such as optional phrases [x] and alternate expressions (x|y), which get instantiated at random.…”

Section: The Question Enginementioning

confidence: 99%

“…For VQA1.0, blind models achieve 50% in accuracy without even considering the images whatsoever[4]. Similarly, for VQA2.0, 67% and 27% of the binary and open questions respectively are answered correctly by such models[11].…”

mentioning

confidence: 94%

“…Is there any milk in the bowl to the left of the apple? across the board, with a host of datasets being constructed [4,11,15,41,20] and numerous models being proposed [5,38,6,10,12].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Hudson

Manning

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

649

302

View full text Add to dashboard Cite

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.

show abstract

Section: The Question Enginementioning

confidence: 99%

mentioning

confidence: 94%

See 1 more Smart Citation

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Hudson

Manning

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

649

302

View full text Add to dashboard Cite

show abstract

“…To further investigate the relevance of our findings to biological visual systems, in follow-up work we intend to deploy our modulation scheme on architectures that bear more similarity to the primate visual hierarchy, such as deep convolutional networks (Kriegeskorte, 2015), datasets of naturalistic images such as ImageNet (Russakovsky et al, 2015), and general naturalistic tasks such as visual question answering (Agrawal et al, 2017). This will allow us to assess whether the functional advantage provided by early modulation holds true in a more realistic scenario, and whether the resulting modulation schemes resemble those observed in the early visual areas of the primate brain.…”

Section: Discussionmentioning

confidence: 99%

Modulation of early visual processing alleviates capacity limits in solving multiple tasks

Thorat¹,

Aldegheri²,

Gerven³

et al. 2019

2019 Conference on Cognitive Computational Neuroscience

View full text Add to dashboard Cite

In daily life situations, we have to perform multiple tasks given a visual stimulus, which requires task-relevant information to be transmitted through our visual system. When it is not possible to transmit all the possibly relevant information to higher layers, due to a bottleneck, task-based modulation of early visual processing might be necessary. In this work, we report how the effectiveness of modulating the early processing stage of an artificial neural network depends on the information bottleneck faced by the network. The bottleneck is quantified by the number of tasks the network has to perform and the neural capacity of the later stage of the network. The effectiveness is gauged by the performance on multiple object detection tasks, where the network is trained with a recent multi-task optimisation scheme. By associating neural modulations with task-based switching of the state of the network and characterising when such switching is helpful in early processing, our results provide a functional perspective towards understanding why task-based modulation of early neural processes might be observed in the primate visual cortex 23 .

show abstract

“…Next, we verify the applicability of the 3-D scene graph by demonstrating two major applications: 1) visual question and answering (VQA) and 2) task planning. The two applications are under active research in computer vision [5], NLP [6], and robotics societies [7].…”

mentioning

confidence: 99%

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents

Kim

Park

Song

et al. 2020

IEEE Trans. Cybern.

View full text Add to dashboard Cite

Intelligent agents gather information and perceive semantics within the environments before taking on given tasks. The agents store the collected information in the form of environment models that compactly represent the surrounding environments. The agents, however, can only conduct limited tasks without an efficient and effective environment model. Thus, such an environment model takes a crucial role for the autonomy systems of intelligent agents. We claim the following characteristics for a versatile environment model: accuracy, applicability, usability, and scalability. Although a number of researchers have attempted to develop such models that represent environments precisely to a certain degree, they lack broad applicability, intuitive usability, and satisfactory scalability. To tackle these limitations, we propose 3-D scene graph as an environment model and the 3-D scene graph construction framework. The concise and widely used graph structure readily guarantees usability as well as scalability for 3-D scene graph. We demonstrate the accuracy and applicability of the 3-D scene graph by exhibiting the deployment of the 3-D scene graph in practical applications. Moreover, we verify the performance of the proposed 3-D scene graph and the framework by conducting a series of comprehensive experiments under various conditions.

show abstract

VQA: Visual Question Answering

Cited by 306 publications

References 39 publications

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Modulation of early visual processing alleviates capacity limits in solving multiple tasks

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents

Contact Info

Product

Resources

About