Stacked Attention Networks for Image Question Answering

Yang, Zichao; He, Xiaodong; Gao, Jianfeng; Deng, Li; Smola, Alex

doi:10.1109/cvpr.2016.10

Cited by 1,631 publications

(1,161 citation statements)

References 37 publications

Supporting

Mentioning

1,149

Contrasting

Unclassified

Order By: Relevance

“…The second two rows of Table 1 exhibit the performance of methods with one-glimpse attention mechanism. SAN [4] employs element-wise addition operation to calculate attention map and gets better performance than the methods in the first part. In MCAN, we replace the addition or the multiplication operation with concatenated operation to compute the joint learning attention map, and gain further improvements.…”

Section: B Results and Analysismentioning

confidence: 99%

“…A number of methods employ question-guided attention to solve VQA task. Yang et al [4] introduced the soft attention, and proposed a stacked attention model which used question representations to query question-related regions in image via multi-step reasoning. Noh et al [15] adopted visual attention with joint loss minimization.…”

Section: B Attention Mechanisms For Vqamentioning

confidence: 99%

“…Many previous methods [1][2][3]7] adopt this approach, while some [5,8] solve VQA task by modifying the basic idea. Besides LSTM, these approaches [3, 9-11] adopted GRU to extract high-level semantic and some [4,12,13] utilized CNN to encode question. There are several methods different from above ones, which addressed VQA task as a multi-way classification problem.…”

Section: A Joint Embeddingmentioning

confidence: 99%

“…However, these methods can't locate the visual finegrained regions related to the question. Therefore, many approaches [4,5] picked up spatial representation from CNN and introduced visual attention mechanism to attend to the finegrained regions relevant to the question. The improvements are significant.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

Abstract-Visual Question Answering (VQA) is an attractive topic combining computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual comprehension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and questions. To remedy this, we imitate the human mechanism of cross-reasoning about visual and textual information and propose a multimodal cross-guided attention network (MCAN) for VQA which employs a cross-guided joint learning strategy with a gated activation learning method, which can simultaneously capture both rich visual spatial information and significant semantic information. We evaluate the proposed model on two public datasets: VQA dataset and COCO-QA dataset. Extensive experiments show state-of-the-art performance on the datasets.

show abstract

Section: B Results and Analysismentioning

confidence: 99%

Section: B Attention Mechanisms For Vqamentioning

confidence: 99%

Section: A Joint Embeddingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

show abstract

“…Our approach in the combination with the Normalized Correlation Analysis embedding technique improves on the state-of-the-art of the Visual Madlibs task. Text-Embedding Loss: Motivated by the popularity of deep architectures for visual question answering, that combine a global CNN image representation with an LSTM [7] question representation [4,13,17,20,29,30,31], as well as the leading performance of nCCA on the multi-choice Visual Madlibs task [32], we propose a novel extension of the CNN+LSTM architecture that chooses a prompt completion out of four candidates (see Figure 4) by measuring similarities directly in the embedding space. This contrasts with the prior approach of [32] that uses a post-hoc comparison between the discrete output of the CNN+LSTM method and all four candidates.…”

Section: Arxiv:160802717v1 [Cscv] 9 Aug 2016mentioning

confidence: 99%

Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task

Malinowski

Mokarian

Fritz

2016

Procedings of the British Machine Vision Conference 2016

View full text Add to dashboard Cite

We present Mean Box Pooling, a novel visual representation that pools over CNN representations of a large number, highly overlapping object proposals. We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on the Visual Madlibs task. Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by directly maximizing the similarity between the internal representation of the deep learning architecture and candidate answers. Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs.

show abstract

Explainable neural computation via stack neural module networks

Andreas

Darrell

et al. 2021

Applied AI Letters

View full text Add to dashboard Cite

In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction. Existing models designed to produce interpretable traces of their decision-making process typically require these traces to be supervised at training time. In this paper, we present a novel neural modular approach that performs compositional reasoning by automatically inducing a desired subtask decomposition without relying on strong supervision. Our model allows linking different reasoning tasks through shared modules that handle common routines across tasks. Experiments show that the model is more interpretable to human evaluators compared to other state-of-the-art models: users can better understand the model's underlying reasoning procedure and predict when it will succeed or fail based on observing its intermediate outputs.

show abstract

Stacked Attention Networks for Image Question Answering

Cited by 1,631 publications

References 37 publications

Multimodal Cross-guided Attention Networks for Visual Question Answering

Multimodal Cross-guided Attention Networks for Visual Question Answering

Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task

Explainable neural computation via stack neural module networks

Contact Info

Product

Resources

About