Dual Attention Networks for Multimodal Reasoning and Matching

Nam, Hyeonseob; Ha, Jung-Woo; Kim, Jeong-Hee

doi:10.1109/cvpr.2017.232

Cited by 606 publications

(408 citation statements)

References 26 publications

Supporting

Mentioning

383

Contrasting

Order By: Relevance

“…In particular, the neural attention mechanism is introduced to weigh the contributions of features from individual atoms and residues, which has been proved to be more effective than simply averaging all the atom and residue features (the results of the corresponding ablation studies are shown in Figs S5-S6). The dual attention network (DAN) [28] is a recently published method that can produce attentions for two given related entities (each with a list of features). For example, given an image with a sentence annotation, DAN generates a textual attention for the word features of the sentence and a visual attention for the spatial features of the image.…”

Section: Problem Formulationmentioning

confidence: 99%

MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins

Wan

Shu

et al. 2019

Preprint

View full text Add to dashboard Cite

Computational approaches for inferring the mechanisms of compound-protein interactions (CPIs) can greatly facilitate drug development. Recently, although a number of deep learning based methods have been proposed to predict binding affinities and attempt to capture local interaction sites in compounds and proteins through neural attentions, they still lack a systematic evaluation on the interpretability of the identified local features. In addition, in these previous approaches, the exact matchings between interaction sites from compounds and proteins, which are generally important for understanding drug mechanisms of action, still remain unknown. Here, we compiled the first benchmark dataset containing the inter-molecular non-covalent interactions for more than 10,000 compound-protein pairs, and used it to systematically evaluate the interpretability of neural attentions in existing prediction models. We developed a multi-objective neural network, called MONN, to predict both non-covalent interactions and binding affinity for a given compound-protein pair. MONN uses convolution neural networks on molecular graphs of compounds and primary sequences of proteins to effectively capture the intrinsic features from both inputs, and also takes advantage of the predicted non-covalent interactions to further boost the accuracy of binding affinity prediction. Comprehensive evaluation demonstrated that while the previous neural attention based approaches fail to exhibit satisfactory interpretability results without extra supervision, MONN can successfully predict non-covalent interactions on our benchmark dataset as well as another independent dataset derived from the Protein Data Bank (PDB). Moreover, MONN can outperform other state-of-the-art methods in predicting compound-protein binding affinities. In addition, the pairwise interactions predicted by MONN displayed compatible and accordant patterns in chemical properties, which provided another evidence to support the strong predictive power of MONN. These results suggested that MONN can offer a powerful tool in predicting binding affinities of compound-protein pairs and also provide useful insights into understanding the molecular mechanisms of compound-protein interactions, which thus can greatly advance the drug discovery process. The source code of the MONN model and the dataset creation process can be downloaded from https://github.com/lishuya17/MONN.

show abstract

Section: Problem Formulationmentioning

confidence: 99%

MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins

Wan

Shu

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Lu et al [19] presented a hierarchical coattention model that jointly reasons about image and question attention. Nam et al [20] proposed Dual Attention Network that attend to special regions in images and words in text through multiple steps and gather essential information from both modalities. Compared with these methods, our coattention framework combine SWA and Question-Guide Image Attention (QIA) together for multimodal representation.…”

Section: A Feature Extraction and Representationmentioning

confidence: 99%

“…By reducing the effect of unimportant textual information, co-attention methods can effectively get richer multimodal representations. The textual attention in common co-attention frameworks [19], [20] is to obtain question attention based on visual features, in the sense that the image representation is used to guide the question attention and the question representation are used to guide image attention in such co-attention frameworks.…”

Section: Introductionmentioning

confidence: 99%

Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering

Lao

Guo

Wang

et al. 2018

Preprint

View full text Add to dashboard Cite

Visual question answering (VQA) is receiving increasing attention from researchers in both the computer vision and natural language processing fields. There are two key components in the VQA task: feature extraction and multi-modal fusion. For feature extraction, we introduce a novel co-attention scheme by combining Sentence-guide Word Attention (SWA) and Question-guide Image Attention (QIA) in a unified framework. To be specific, the textual attention SWA relies on the semantics of the whole question sentence to calculate contributions of different question words for text representation. For the multi-modal fusion, we propose a "Cross-modal Multistep Fusion (CMF)" network to generate multistep features and achieve multiple interactions for two modalities, rather than focusing on modeling complex interactions between two modals like most current feature fusion methods. To avoid the linear increase of the computational cost, we share the parameters for each step in the CMF. Extensive experiments demonstrate that the proposed method can achieve competitive or better performance than the state-of-the-art. INDEX TERMS visual question answering, cross-modal multistep fusion network, attention mechanism

show abstract

“…Unlike [18], Nam et al [19] calculated the textual and visual attention map by a refined multiplication operation. Wang et al [20] extracted "facts" from image and proposed a novel co-attention approach to address VQA task.…”

Section: B Attention Mechanisms For Vqamentioning

confidence: 99%

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

Abstract-Visual Question Answering (VQA) is an attractive topic combining computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual comprehension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and questions. To remedy this, we imitate the human mechanism of cross-reasoning about visual and textual information and propose a multimodal cross-guided attention network (MCAN) for VQA which employs a cross-guided joint learning strategy with a gated activation learning method, which can simultaneously capture both rich visual spatial information and significant semantic information. We evaluate the proposed model on two public datasets: VQA dataset and COCO-QA dataset. Extensive experiments show state-of-the-art performance on the datasets.

show abstract

Dual Attention Networks for Multimodal Reasoning and Matching

Cited by 606 publications

References 26 publications

MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins

MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins

Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering

Multimodal Cross-guided Attention Networks for Visual Question Answering

Contact Info

Product

Resources

About