2016
DOI: 10.1609/aaai.v30i1.10442
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Answer Questions from Image Using Convolutional Neural Network

Abstract: In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA) task. Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 136 publications
(19 citation statements)
references
References 22 publications
0
19
0
Order By: Relevance
“…The VQA task, where a VQA agent is expected to correctly answer a question related to an image, was proposed by Antol et al (2015). Most of the early VQA models (Antol et al 2015;Andreas et al 2016;Ben-younes et al 2017;Fukui et al 2016;Lu et al 2016;Ma, Lu, and Li 2016) integrate a CNN-RNN based architecture that fuses the RNN encoding of the question and the CNN encoding of the image to predict the answer, possibly improved by attention mechanisms to highlight the visual objects that are related to the question (Yang et al 2016;Anderson et al 2018;Lu et al 2016). Recently, graph neural networks that represent the image as a scene graph, where nodes are objects and edges are relations between two connected objects, has attracted attention in many vision-language tasks including VQA.…”
Section: Related Work Visual Question Answeringmentioning
confidence: 99%
“…The VQA task, where a VQA agent is expected to correctly answer a question related to an image, was proposed by Antol et al (2015). Most of the early VQA models (Antol et al 2015;Andreas et al 2016;Ben-younes et al 2017;Fukui et al 2016;Lu et al 2016;Ma, Lu, and Li 2016) integrate a CNN-RNN based architecture that fuses the RNN encoding of the question and the CNN encoding of the image to predict the answer, possibly improved by attention mechanisms to highlight the visual objects that are related to the question (Yang et al 2016;Anderson et al 2018;Lu et al 2016). Recently, graph neural networks that represent the image as a scene graph, where nodes are objects and edges are relations between two connected objects, has attracted attention in many vision-language tasks including VQA.…”
Section: Related Work Visual Question Answeringmentioning
confidence: 99%
“…Inspired by the success of CNN on image classification, early methods adopt CNN models (e.g., VGGNet (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al, 2012), GoogLeNet (Szegedy et al, 2015), and ResNet (He et al, 2016)) pre-trained on ImageNet (Deng et al, 2009) to extract visual features. The very first VQA model (Antol et al, 2015) experiments with global visual features from the last fully connected layer of VGGNet, which has been inherited by the immediate follow-up works (Gao et al, 2015;Ren et al, 2015a;Ma et al, 2016). To retain spatial information in the original images, researchers (Yang et al, 2016;Zhu et al, 2016;Andreas et al, 2016b;Jabri et al, 2016) use grid features from earlier layers of pre-trained CNN models.…”
Section: Model Architecturementioning
confidence: 99%
“…(Zhou et al 2015) proposed a simple baseline, which learns image features with CNN and question features from LSTM, and concatenated these two features to predict the answer. Instead of using LSTM for learning question representations, (Noh, Hongsuck Seo, and Han 2016) used GRU (Cho et al 2014) and (Ma, Lu, and Li 2016) trained CNN for question embedding. Different from the above mentioned methods addressing the VQA task as a classification problem, the work by (Malinowski, Rohrbach, and Fritz 2015) fed both image CNN features and question representations into an LSTM to generate the answer by sequence-tosequence learning.…”
Section: Related Workmentioning
confidence: 99%