Cross-modal knowledge reasoning for knowledge-based visual question answering

Yu, Jing; Zhu, Zihao; Wang, Yujing; Wei-feng, Zhang; Hu, Yue; Tan, Jun

doi:10.1016/j.patcog.2020.107563

Cited by 87 publications

(39 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual symbolic information in MMKG with graph-structured information conveying relations between visual concepts provides strong evidence to reason about the questions over graph network. Besides, the explicit semantic knowledge preserved in MMKG help refine the answers with more interpretability and generality [154]. The representations of different modalities preserved and unified in MMKG greatly benefit for relational reasoning across modalities.…”

Section: Visual Question Answeringmentioning

confidence: 99%

Multi-Modal Knowledge Graph Construction and Application: A Survey

Zhu¹,

Li²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent years have witnessed the resurgence of knowledge engineering which is featured by the fast growth of knowledge graphs. However, most of existing knowledge graphs are represented with pure symbols, which hurts the machine's capability to understand the real world. The multi-modalization of knowledge graphs is an inevitable key step towards the realization of human-level machine intelligence. The results of this endeavor are Multi-modal Knowledge Graphs (MMKGs). In this survey on MMKGs constructed by texts and images, we first give definitions of MMKGs, followed with the preliminaries on multi-modal tasks and techniques. We then systematically review the challenges, progresses and opportunities on the construction and application of MMKGs respectively, with detailed analyses of the strength and weakness of different solutions. We finalize this survey with open research problems relevant to MMKGs.

show abstract

Section: Visual Question Answeringmentioning

confidence: 99%

Multi-Modal Knowledge Graph Construction and Application: A Survey

Zhu¹,

Li²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Semantic relations features and additional commonsense knowledge answer the complex questions for natural language reasoning. J. Yu et al [111] proposed a framework in which visual contents of an image is extracted and processed in multiple perspectives of knowledge graph like semantic, visual, and factual perspectives.…”

Section: Multimodal External Knowledge Bases Models (Mmekm)mentioning

confidence: 99%

A Review on Methods and Applications in Multimodal Deep Learning

Summaira¹,

Li²,

Shoib³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.CCS Concepts: • Computing methodologies → Machine learning; • Information systems → Multimedia and multimodal retrieval.

show abstract

“…Once the particular type of question exceeds the scope of the question templates, the accuracy of the model decreases. Yu et al [29] formulated knowledge-based visual question answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. Marino et al [30] addressed the task of knowledgebased visual question answering and provided a benchmark where the image features relied on external knowledge resources.…”

Section: Knowledge Basementioning

confidence: 99%

Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Tang

2022

Sensors

View full text Add to dashboard Cite

Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image–question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.

show abstract

Cross-modal knowledge reasoning for knowledge-based visual question answering

Cited by 87 publications

References 5 publications

Multi-Modal Knowledge Graph Construction and Application: A Survey

Multi-Modal Knowledge Graph Construction and Application: A Survey

A Review on Methods and Applications in Multimodal Deep Learning

Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Contact Info

Product

Resources

About