Learning to Answer Questions from Image Using Convolutional Neural Network

Ma, Lin; Lu, Zhengdong; Li, Hang

doi:10.1609/aaai.v30i1.10442

Cited by 136 publications

(19 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The VQA task, where a VQA agent is expected to correctly answer a question related to an image, was proposed by Antol et al (2015). Most of the early VQA models (Antol et al 2015;Andreas et al 2016;Ben-younes et al 2017;Fukui et al 2016;Lu et al 2016;Ma, Lu, and Li 2016) integrate a CNN-RNN based architecture that fuses the RNN encoding of the question and the CNN encoding of the image to predict the answer, possibly improved by attention mechanisms to highlight the visual objects that are related to the question (Yang et al 2016;Anderson et al 2018;Lu et al 2016). Recently, graph neural networks that represent the image as a scene graph, where nodes are objects and edges are relations between two connected objects, has attracted attention in many vision-language tasks including VQA.…”

Section: Related Work Visual Question Answeringmentioning

confidence: 99%

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Moens

2022

AAAI

View full text Add to dashboard Cite

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

show abstract

Section: Related Work Visual Question Answeringmentioning

confidence: 99%

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Moens

2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Inspired by the success of CNN on image classification, early methods adopt CNN models (e.g., VGGNet (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al, 2012), GoogLeNet (Szegedy et al, 2015), and ResNet (He et al, 2016)) pre-trained on ImageNet (Deng et al, 2009) to extract visual features. The very first VQA model (Antol et al, 2015) experiments with global visual features from the last fully connected layer of VGGNet, which has been inherited by the immediate follow-up works (Gao et al, 2015;Ren et al, 2015a;Ma et al, 2016). To retain spatial information in the original images, researchers (Yang et al, 2016;Zhu et al, 2016;Andreas et al, 2016b;Jabri et al, 2016) use grid features from earlier layers of pre-trained CNN models.…”

Section: Model Architecturementioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

“…(Zhou et al 2015) proposed a simple baseline, which learns image features with CNN and question features from LSTM, and concatenated these two features to predict the answer. Instead of using LSTM for learning question representations, (Noh, Hongsuck Seo, and Han 2016) used GRU (Cho et al 2014) and (Ma, Lu, and Li 2016) trained CNN for question embedding. Different from the above mentioned methods addressing the VQA task as a classification problem, the work by (Malinowski, Rohrbach, and Fritz 2015) fed both image CNN features and question representations into an LSTM to generate the answer by sequence-tosequence learning.…”

Section: Related Workmentioning

confidence: 99%

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Zhang

et al. 2018

AAAI

View full text Add to dashboard Cite

Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The free-form region based and the detection-based visual attention mechanisms are mostly investigated, with the former ones attending free-form image regions and the latter ones attending pre-specified detection-box regions. We argue that the two attention mechanisms are able to provide complementary information and should be effectively integrated to better solve the VQA problem. In this paper, we propose a novel deep neural network for VQA that integrates both attention mechanisms. Our proposed framework effectively fuses features from free-form image regions, detection boxes, and question representations via a multi-modal multiplicative feature embedding scheme to jointly attend question-related free-form image regions and detection boxes for more accurate question answering. The proposed method is extensively evaluated on two publicly available datasets, COCO-QA and VQA, and outperforms state-of-the-art approaches. Source code is available at https://github.com/lupantech/dual-mfa-vqa.

show abstract

Learning to Answer Questions from Image Using Convolutional Neural Network

Cited by 136 publications

References 22 publications

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Contact Info

Product

Resources

About