2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.232
|View full text |Cite
|
Sign up to set email alerts
|

Dual Attention Networks for Multimodal Reasoning and Matching

Abstract: We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language. DANs attend to specific regions in images and words in text through multiple steps and gather essential information from both modalities. Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively. The reasoning model allows visual and textual attentions to steer each other during collaborative in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
383
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 606 publications
(408 citation statements)
references
References 26 publications
1
383
0
Order By: Relevance
“…In particular, the neural attention mechanism is introduced to weigh the contributions of features from individual atoms and residues, which has been proved to be more effective than simply averaging all the atom and residue features (the results of the corresponding ablation studies are shown in Figs S5-S6). The dual attention network (DAN) [28] is a recently published method that can produce attentions for two given related entities (each with a list of features). For example, given an image with a sentence annotation, DAN generates a textual attention for the word features of the sentence and a visual attention for the spatial features of the image.…”
Section: Problem Formulationmentioning
confidence: 99%
“…In particular, the neural attention mechanism is introduced to weigh the contributions of features from individual atoms and residues, which has been proved to be more effective than simply averaging all the atom and residue features (the results of the corresponding ablation studies are shown in Figs S5-S6). The dual attention network (DAN) [28] is a recently published method that can produce attentions for two given related entities (each with a list of features). For example, given an image with a sentence annotation, DAN generates a textual attention for the word features of the sentence and a visual attention for the spatial features of the image.…”
Section: Problem Formulationmentioning
confidence: 99%
“…Lu et al [19] presented a hierarchical coattention model that jointly reasons about image and question attention. Nam et al [20] proposed Dual Attention Network that attend to special regions in images and words in text through multiple steps and gather essential information from both modalities. Compared with these methods, our coattention framework combine SWA and Question-Guide Image Attention (QIA) together for multimodal representation.…”
Section: A Feature Extraction and Representationmentioning
confidence: 99%
“…By reducing the effect of unimportant textual information, co-attention methods can effectively get richer multimodal representations. The textual attention in common co-attention frameworks [19], [20] is to obtain question attention based on visual features, in the sense that the image representation is used to guide the question attention and the question representation are used to guide image attention in such co-attention frameworks.…”
Section: Introductionmentioning
confidence: 99%
“…Unlike [18], Nam et al [19] calculated the textual and visual attention map by a refined multiplication operation. Wang et al [20] extracted "facts" from image and proposed a novel co-attention approach to address VQA task.…”
Section: B Attention Mechanisms For Vqamentioning
confidence: 99%