2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.10
|View full text |Cite
|
Sign up to set email alerts
|

Stacked Attention Networks for Image Question Answering

Abstract: This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. We argue that image question answering (QA) often requires multiple steps of reasoning. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experiments conducted on four image QA data sets demonstrate that the p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

8
1,149
3
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 1,631 publications
(1,161 citation statements)
references
References 37 publications
8
1,149
3
1
Order By: Relevance
“…The second two rows of Table 1 exhibit the performance of methods with one-glimpse attention mechanism. SAN [4] employs element-wise addition operation to calculate attention map and gets better performance than the methods in the first part. In MCAN, we replace the addition or the multiplication operation with concatenated operation to compute the joint learning attention map, and gain further improvements.…”
Section: B Results and Analysismentioning
confidence: 99%
See 3 more Smart Citations
“…The second two rows of Table 1 exhibit the performance of methods with one-glimpse attention mechanism. SAN [4] employs element-wise addition operation to calculate attention map and gets better performance than the methods in the first part. In MCAN, we replace the addition or the multiplication operation with concatenated operation to compute the joint learning attention map, and gain further improvements.…”
Section: B Results and Analysismentioning
confidence: 99%
“…A number of methods employ question-guided attention to solve VQA task. Yang et al [4] introduced the soft attention, and proposed a stacked attention model which used question representations to query question-related regions in image via multi-step reasoning. Noh et al [15] adopted visual attention with joint loss minimization.…”
Section: B Attention Mechanisms For Vqamentioning
confidence: 99%
See 2 more Smart Citations
“…Our approach in the combination with the Normalized Correlation Analysis embedding technique improves on the state-of-the-art of the Visual Madlibs task. Text-Embedding Loss: Motivated by the popularity of deep architectures for visual question answering, that combine a global CNN image representation with an LSTM [7] question representation [4,13,17,20,29,30,31], as well as the leading performance of nCCA on the multi-choice Visual Madlibs task [32], we propose a novel extension of the CNN+LSTM architecture that chooses a prompt completion out of four candidates (see Figure 4) by measuring similarities directly in the embedding space. This contrasts with the prior approach of [32] that uses a post-hoc comparison between the discrete output of the CNN+LSTM method and all four candidates.…”
Section: Arxiv:160802717v1 [Cscv] 9 Aug 2016mentioning
confidence: 99%