Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2016
DOI: 10.18653/v1/d16-1044
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Abstract: Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
1,021
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 1,173 publications
(1,052 citation statements)
references
References 36 publications
5
1,021
0
Order By: Relevance
“…As we can see in Table1, MCAN (k=2) obtains best performance on the whole. It improves MCB [5] from 68.6% to 70.1% for the MultipleChoice task on Test-dev set. The third three rows of TABLE I shows the methods with co-attention mechanism which contains several joint attention mechanisms.…”
Section: B Results and Analysismentioning
confidence: 98%
See 4 more Smart Citations
“…As we can see in Table1, MCAN (k=2) obtains best performance on the whole. It improves MCB [5] from 68.6% to 70.1% for the MultipleChoice task on Test-dev set. The third three rows of TABLE I shows the methods with co-attention mechanism which contains several joint attention mechanisms.…”
Section: B Results and Analysismentioning
confidence: 98%
“…In MCAN, we replace the addition or the multiplication operation with concatenated operation to compute the joint learning attention map, and gain further improvements. MCB [5] gains state-of-the-art performance in VQA dataset that employs a feedforward CNN to calculate attention weights with high-dimensional joint features. MCAN reduce the dimension of joint features and gains better performance.…”
Section: B Results and Analysismentioning
confidence: 99%
See 3 more Smart Citations