Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources

Wu, Qi; Wang, Peng; Shen, Chunhua; Dick, Anthony; Hengel, Anton van den

doi:10.1109/cvpr.2016.500

Cited by 299 publications

(226 citation statements)

References 29 publications

Supporting

Mentioning

214

Contrasting

Order By: Relevance

“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”

Section: Related Workmentioning

confidence: 99%

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

1,633

1,372

View full text Add to dashboard Cite

Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset [3] by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0).We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners.Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counterexample based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users. * The first two authors contributed equally.

show abstract

Section: Related Workmentioning

confidence: 99%

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

1,633

1,372

View full text Add to dashboard Cite

show abstract

“…Specifically, VQA takes an image and a corresponding natural language question as input and outputs the answer. It is a classification problem in which candidate answers are restricted to the most common answers appearing in the dataset and requires deep analysis and understanding of images and questions such as image recognition and object localization [16,27,38,42]. Current models can be classified into three main categories: early fusion models, later fusion models, and external knowledge-based models.…”

Section: Related Workmentioning

confidence: 99%

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Gao

Chen

Liu

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale realworld dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the * Equal contribution. Ordering is decided by a coin flip. Work performed during an internship at IIAI. † WICT is the abbreviation of Wangxuan Institute of Computer Technology. effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs 1 .

show abstract

“…Several researchers employed commonsense knowledge to enrich high-level understanding tasks such as visual ques- Figure 2: (a) Example of questions that require explicit external knowledge [35], (b) Example where knowledge helps [37]. (c) Ways to integrate background knowledge: i) Pre-process knowledge and augment input [1]; ii) Incorporate knowledge as embeddings [36]; iii) Post-processing using explicit reasoning mechanism [2]; iv) Using knowledge graph to influence NN architecture [24].…”

Section: High-level Common-sense Knowledgementioning

confidence: 99%

Integrating Knowledge and Reasoning in Image Understanding

Aditya¹,

Yang

Baral

2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Deep learning based data-driven approaches have been successfully applied in various image understanding applications ranging from object recognition, semantic segmentation to visual question answering. However, the lack of knowledge integration as well as higher-level reasoning capabilities with the methods still pose a hindrance. In this work, we present a brief survey of a few representative reasoning mechanisms, knowledge integration methods and their corresponding image understanding applications developed by various groups of researchers, approaching the problem from a variety of angles. Furthermore, we discuss upon key efforts on integrating external knowledge with neural networks. Taking cues from these efforts, we conclude by discussing potential pathways to improve reasoning capabilities.

show abstract

Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources

Cited by 299 publications

References 29 publications

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Integrating Knowledge and Reasoning in Image Understanding

Contact Info

Product

Resources

About