Multimodal Compact Bilinear Pooling for Visual Question Answering
            and Visual Grounding

Fukui, Akira; Park, Dong Huk; Yang, Daylen; Rohrbach, Anna; Darrell, Trevor; Rohrbach, Marcus

doi:10.18653/v1/d16-1044

Cited by 1,173 publications

(1,052 citation statements)

References 36 publications

Supporting

Mentioning

1,021

Contrasting

Order By: Relevance

“…As we can see in Table1, MCAN (k=2) obtains best performance on the whole. It improves MCB [5] from 68.6% to 70.1% for the MultipleChoice task on Test-dev set. The third three rows of TABLE I shows the methods with co-attention mechanism which contains several joint attention mechanisms.…”

Section: B Results and Analysismentioning

confidence: 98%

“…In MCAN, we replace the addition or the multiplication operation with concatenated operation to compute the joint learning attention map, and gain further improvements. MCB [5] gains state-of-the-art performance in VQA dataset that employs a feedforward CNN to calculate attention weights with high-dimensional joint features. MCAN reduce the dimension of joint features and gains better performance.…”

Section: B Results and Analysismentioning

confidence: 99%

“…These models utilize CNN to extract semantic representations from images and encode questions via RNN, especially LSTM, and then combine two modalities with an appropriate joint learning method. Many previous methods [1][2][3]7] adopt this approach, while some [5,8] solve VQA task by modifying the basic idea. Besides LSTM, these approaches [3, 9-11] adopted GRU to extract high-level semantic and some [4,12,13] utilized CNN to encode question.…”

Section: A Joint Embeddingmentioning

confidence: 99%

“…Ilievski et al [17] used an off-the-shelf object detector to catch the important regions, and then fed the regions into LSTM with global image features. Fukui et al [5] applied a convolutional operation on the concatenated textual representations and image representations to obtain the attention weights all over the regions. However, all of the above attention based methods are one-glimpse attention, just using the question to guide the spatial attention, and ignoring the image information as the important semantic guidance for question.…”

Section: B Attention Mechanisms For Vqamentioning

confidence: 99%

“…However, these methods can't locate the visual finegrained regions related to the question. Therefore, many approaches [4,5] picked up spatial representation from CNN and introduced visual attention mechanism to attend to the finegrained regions relevant to the question. The improvements are significant.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

Abstract-Visual Question Answering (VQA) is an attractive topic combining computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual comprehension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and questions. To remedy this, we imitate the human mechanism of cross-reasoning about visual and textual information and propose a multimodal cross-guided attention network (MCAN) for VQA which employs a cross-guided joint learning strategy with a gated activation learning method, which can simultaneously capture both rich visual spatial information and significant semantic information. We evaluate the proposed model on two public datasets: VQA dataset and COCO-QA dataset. Extensive experiments show state-of-the-art performance on the datasets.

show abstract

Section: B Results and Analysismentioning

confidence: 98%

Section: B Results and Analysismentioning

confidence: 99%

Section: A Joint Embeddingmentioning

confidence: 99%

Section: B Attention Mechanisms For Vqamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

show abstract

Explainable neural computation via stack neural module networks

Andreas

Darrell

et al. 2021

Applied AI Letters

Self Cite

View full text Add to dashboard Cite

In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction. Existing models designed to produce interpretable traces of their decision-making process typically require these traces to be supervised at training time. In this paper, we present a novel neural modular approach that performs compositional reasoning by automatically inducing a desired subtask decomposition without relying on strong supervision. Our model allows linking different reasoning tasks through shared modules that handle common routines across tasks. Experiments show that the model is more interpretable to human evaluators compared to other state-of-the-art models: users can better understand the model's underlying reasoning procedure and predict when it will succeed or fail based on observing its intermediate outputs.

show abstract

Improving users' mental model with attention‐directed counterfactual edits

et al. 2021

View full text Add to dashboard Cite

In the domain of visual question answering (VQA), studies have shown improvement in users' mental model of the VQA system when they are exposed to examples of how these systems answer certain image‐question (IQ) pairs. In this work, we show that showing controlled counterfactual IQ examples are more effective at improving the mental model of users as compared to simply showing random examples. We compare a generative approach and a retrieval‐based approach to show counterfactual examples. We use recent advances in generative adversarial networks to generate counterfactual images by deleting and inpainting certain regions of interest in the image. We then expose users to changes in the VQA system's answer on those altered images. To select the region of interest for inpainting, we experiment with using both human‐annotated attention maps and a fully automatic method that uses the VQA system's attention values. Finally, we test the user's mental model by asking them to predict the model's performance on a test counterfactual image. We note an overall improvement in users' accuracy to predict answer change when shown counterfactual explanations. While realistic retrieved counterfactuals obviously are the most effective at improving the mental model, we show that a generative approach can also be equally effective.

show abstract

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Cited by 1,173 publications

References 36 publications

Multimodal Cross-guided Attention Networks for Visual Question Answering

Multimodal Cross-guided Attention Networks for Visual Question Answering

Explainable neural computation via stack neural module networks

Improving users' mental model with attention‐directed counterfactual edits

Contact Info

Product

Resources

About