Word-to-region attention network for visual question answering

Peng, Liang; Yang, Yang; Bin, Yi; Xie, Ning; Shen, Fumin; Ji, Yanli; Xu, Xing

doi:10.1007/s11042-018-6389-3

Cited by 25 publications

(2 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yang et al [18] proposed using the types of problems to classify them and carry out a common attention mechanism. In 2019, Peng et al [19] used two general attention units, SA (self-attention) and GA (guide-attention) to form a modular common attention structure through the combination of SA and GA. In 2020, Guo et al [20] proposed a visual question answering method based on the reattention mechanism, which uses the answers to calculate the attention weight of images and defines an attention consistency loss function to measure the distance between the visual attention features learned through the questions and answers and inversely adjust the attention weight distribution of images.…”

Section: Attention Mechanismmentioning

confidence: 99%

JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features

Liang

Chen

et al. 2022

Mathematical Problems in Engineering

View full text Add to dashboard Cite

In recent years, region features extracted from target detection networks have played an important role in visual question answering. The region features only extract the areas that are related to the target, but they lose a lot of nontarget context information and fine-grained details. On the contrary, the grid feature does not lose the details of nontargets but is not conducive to the recognition of the counting question of multiple small targets in the image. To solve this problem, this paper proposes a visual question answering network via joint grid-region features (JGRCAN), which consists of a feature extraction layer, co-attention layer, and fusion layer. The feature extraction layer includes extracting grid features and region features from the image and text features from the question and extracting multivisual feature representation and question feature representation through the co-attention layer to output attention weight and attention feature representation, respectively. The proposed approach effectively integrates grid features and region features, realizes the complementary advantages of region features and grid features, and is able to accurately focus on areas of the image that are relevant to the answer to the question. The results show that the overall classification accuracy of the algorithm on the test-dev and test-std subsets of VQA-v2 is 70.87% and 71.18%, respectively. Compared with baseline models, our proposed JGRCAN has good performance.

show abstract

Section: Attention Mechanismmentioning

confidence: 99%

JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features

Liang

Chen

et al. 2022

Mathematical Problems in Engineering

View full text Add to dashboard Cite

show abstract

“…with more details and semantics, which is helpful to other visual understanding tasks, such as visual captioning (Bin et al 2017;Gao et al 2017), and visual question answering (Peng et al 2018;Gao et al 2018). Sadeghi and Farhadi (Sadeghi and Farhadi 2011) first define a triplet subject-predicate-object as a visual phrase, and train classifiers for every triplet phrase.…”

Section: Relationship Predictionmentioning

confidence: 99%

MR-NET: Exploiting Mutual Relation for Visual Relationship Detection

Bin

Yang

Tao

et al. 2019

AAAI

Self Cite

View full text Add to dashboard Cite

Inferring the interactions between objects, a.k.a visual relationship detection, is a crucial point for vision understanding, which captures more definite concepts than object detection. Most previous work that treats the interaction between a pair of objects as a one way fail to exploit the mutual relation between objects, which is essential to modern visual application. In this work, we propose a mutual relation net, dubbed MR-Net, to explore the mutual relation between paired objects for visual relationship detection. Specifically, we construct a mutual relation space to model the mutual interaction of paired objects, and employ linear constraint to optimize the mutual interaction, which is called mutual relation learning. Our mutual relation learning does not introduce any parameters, and can adapt to improve the performance of other methods. In addition, we devise a semantic ranking loss to discriminatively penalize predicates with semantic similarity, which is ignored by traditional loss function (e.g., cross entropy with softmax). Then, our MR-Net optimizes the mutual relation learning together with semantic ranking loss with a siamese network. The experimental results on two commonly used datasets (VG and VRD) demonstrate the superior performance of the proposed approach.

show abstract

Supervised Hashing with Recurrent Scaling

Bin

Wang

et al. 2019

Web and Big Data

View full text Add to dashboard Cite

Word-to-region attention network for visual question answering

Cited by 25 publications

References 35 publications

JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features

JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features

MR-NET: Exploiting Mutual Relation for Visual Relationship Detection

Supervised Hashing with Recurrent Scaling

Contact Info

Product

Resources

About