Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Li, Guohao; Su, Hang; Zhu, Wenwu

doi:10.48550/arxiv.1712.00733

Cited by 20 publications

(22 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As consistent with the results on FVQA, we achieve a significant improvement (8.13% on top-1 accuracy and 16.51% on top-3 accuracy ) over state-of-the-art models. Note that our proposed GRUC network is an single-model, which outperforms the existing ensembled model [21]. We believe that the performance can be further improved if the technique of ensemble is involved in our model.…”

Section: Experimental Results On Visual7w-kbmentioning

confidence: 84%

“…However, the visual information is wholly provided which may intro-duce redundant information for reasoning the answer. The same problem also exists in [21], although they leveraged dynamic memory network instead of graph convolutional netowrk to incorporate the external knowledge. Recent work [22] proposed a new knowledge-based task OK-VQA and introduced a retrieval-based model to extract the correct answer from Wikipedia.…”

Section: Incorporating External Knowledge In Vqamentioning

confidence: 99%

“…However, most of questions of Visual7W solely base on the image content which don't require external knowledge. Furthermore, [21] OK-VQA: [22] proposed the Outside Knowledge VQA (OK-VQA) dataset, which is the largest KVQA dataset at present. Different from existing KVQA datasets, the questions in OK-VQA are manually generated by MTurk workers, which are not derived from specific knowledge bases.…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

See 2 more Smart Citations

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Yu,

Zhu,

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as imageconditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. To this end, we decompose the model into a series of memory-based reasoning steps, each performed by a Graph-based Read, Update, and Control (GRUC) module that conducts parallel reasoning over both visual and semantic information.By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. Finally, we perform graph neural networks to infer the global-optimal

show abstract

Section: Experimental Results On Visual7w-kbmentioning

confidence: 84%

Section: Incorporating External Knowledge In Vqamentioning

confidence: 99%

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Yu,

Zhu,

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…External knowledge has gained great interest in natural language processing [3,17] and computer vision [1,11,34]. As the information extracted from training sets are always insufficient to fully recover the real knowledge domain, previous works explicitly incorporate external knowledge to compensate it.…”

Section: External Knowledge Distillationmentioning

confidence: 99%

Saliency Prediction with External Knowledge

Zhang

Jiang

Zhao

2020

Preprint

View full text Add to dashboard Cite

The last decades have seen great progress in saliency prediction, with the success of deep neural networks that are able to encode high-level semantics. Yet, while humans have the innate capability in leveraging their knowledge to decide where to look (e.g. people pay more attention to familiar faces such as celebrities), saliency prediction models have only been trained with large eye-tracking datasets. This work proposes to bridge this gap by explicitly incorporating external knowledge for saliency models as humans do. We develop networks that learn to highlight regions by incorporating prior knowledge of semantic relationships, be it general or domain-specific, depending on the task of interest. At the core of the method is a new Graph Semantic Saliency Network (GraSSNet) that constructs a graph that encodes semantic relationships learned from external knowledge. A Spatial Graph Attention Network is then developed to update saliency features based on the learned graph. Experiments show that the proposed model learns to predict saliency from the external knowledge and outperforms the state-of-the-art on four saliency benchmarks.

show abstract

“…Other methods focused on integrating external prior knowledge, mostly by producing a query to a knowledge database using the question and the image [38]. Extracted external knowledge was also fused with question and image representations [41,26].…”

Section: Related Workmentioning

confidence: 99%

VQA with no questions-answers training

Vatashsky¹,

Ullman²

2018

Preprint

View full text Add to dashboard Cite

Methods for teaching machines to answer visual questions have made significant progress in the last few years, but although demonstrating impressive results on particular datasets, these methods lack some important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answer and handling new domains without new examples. In this paper we present a system that achieves state-of-the-art results on the CLEVR dataset without any questions-answers training, utilizes real visual estimators and explains the answer. The system includes a question representation stage followed by an answering procedure, which invokes an extendable set of visual estimators. It can explain the answer, including its failures, and provide alternatives to negative answers. The scheme builds upon a framework proposed recently, with extensions allowing the system to deal with novel domains without relying on training examples.

show abstract

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Cited by 20 publications

References 30 publications

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Saliency Prediction with External Knowledge

VQA with no questions-answers training

Contact Info

Product

Resources

About