Answer Again: Improving VQA with Cascaded-Answering Model

Peng, Liang; Yang, Yang; Zhang, Xiaopeng; Ji, Yanli; Lu, Huimin; Shen, Heng Tao

doi:10.1109/tkde.2020.2998805

Cited by 18 publications

(8 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, H-CFIM gives lower accuracy compared to V-CFIM as it blends information from different attention paths but avoids the clash between the top-down and bottom-up paths. In Table 5, we have compared the performance of the proposed VQA method with 22 state-of-the-art methods, for example, Re-attention, 23 ALSA, 50 IASSM, 51 MRA-Net 35 and CAM 52 on both test-dev and test-std sets.…”

Section: Based On the Combination Tcam And Cfimmentioning

confidence: 99%

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Agrawal,

Jalal,

Sharma

2023

Computational Intelligence

View full text Add to dashboard Cite

In Visual question answering (VQA), a natural language answer is generated for a given image and a question related to that image. There is a significant growth in the VQA task by applying an efficient attention mechanism. However, current VQA models use region features or object features that are not adequate to improve the accuracy of generated answers. To deal with this issue, we have used a Two‐way Co‐Attention Mechanism (TCAM), which is capable enough to fuse different visual features (region, object, and concept) from diverse perspectives. These diverse features lead to different sets of answers, and also, there is an inherent relationship between these visual features. We have developed a powerful attention mechanism that uses these two critical aspects by using both bottom‐up and top‐down TCAM to extract discriminative feature information. We have proposed a Collective Feature Integration Module (CFIM) to combine multimodal attention features and thus capture the valuable information from these visual features by employing a TCAM. Further, we have formulated a Vertical CFIM for fusing the features belonging to the same class and a Horizontal CFIM for combining the features belonging to different types, thus balancing the influence of top‐down and bottom‐up co‐attention. The experiments are conducted on two significant datasets, VQA 1.0 and VQA 2.0. On VQA 1.0, the overall accuracy of our proposed method is 71.23 on the test‐dev set and 71.94 on the test‐std set. On VQA 2.0, the overall accuracy of our proposed method is 75.89 on the test‐dev set and 76.32 on the test‐std set. The above overall accuracy clearly reflecting the superiority of our proposed TCAM based approach over the existing methods.

show abstract

Section: Based On the Combination Tcam And Cfimmentioning

confidence: 99%

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Agrawal,

Jalal,

Sharma

2023

Computational Intelligence

View full text Add to dashboard Cite

show abstract

“…Recently, multi-modal analysis has attracted a lot of attention with the rapid growth of multi-media data. Different kinds of information they contain are complementary and help achieve comprehensive results [12,30,31,43,53]. So it is significant to learn multi-modal representation for boosting the single-modal tasks.…”

Section: Multi-modal Analysismentioning

confidence: 99%

Text-Embedded Bilinear Model for Fine-Grained Visual Recognition

Sun

Guan

Yang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Fine-grained visual recognition, which aims to identify subcategories of the same base-level category, is a challenging task because of its large intra-class variances and small inter-class variances. Human beings can perform object recognition task based on not only the visual appearance but also the knowledge from texts, as texts can point out the discriminative parts or characteristics which are always the key to distinguishing different subcategories. This is an involuntary transfer from human textual attention to visual attention, suggesting that texts are able to assist fine-grained recognition. In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. Specially, we first conduct a text-embedded network to embed text feature into the discriminative image feature learning to get a embedded feature. In addition, since the cross-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other, we also extract a candidate feature from the text encoder and embed it into the inter-layer feature of the image encoder to get an embedded candidate feature. At last we utilize a cross-layer bilinear network to fuse the two embedded features. Comparing with state-of-the-art methods on the widely used CUB-200-2011 dataset and Oxford Flowers-102 dataset for fine-grained image recognition, the experimental results demonstrate our TEB model achieves the best performance.

show abstract

“…In order to obtain more expressive images and question features, most existing models [1] highlight important words in the question and the image regions associated with the question using attention mechanisms. However, these existing methods only consider object attention, which may be sufficient to answer some simple questions, like the one in figure ??.…”

mentioning

confidence: 99%

MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Tang,

Ran,

Zong

et al. 2023

Preprint

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a hot topic task to answer natural language questions related to the content of visual images. In most VQA models, visual appearance and attribute features are ignored, resulting in complex questions without correct answers.To solve these problems, we propose a new end-to-end VQA model called Multi-modal Attribute Feature Attention Network (MAFA-Net).Firstly, the self-guided word attention modulus is designed to connect entity words with semantic words. Secondly, two problematic adaptive visual attention modules are presented not only to extract important regional features, but also to focus on key attribute features (e.g., color, spatial relationships, etc.). Additionally, a combining strategy is proposed to better explore spatial relationships between objects and their appearance properties. Finally, the experimental results show that MAFA-Net achieves performance competitive with state-of-the-art models on two large-scale VQA datasets.

show abstract

Answer Again: Improving VQA with Cascaded-Answering Model

Cited by 18 publications

References 50 publications

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Text-Embedded Bilinear Model for Fine-Grained Visual Recognition

MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Contact Info

Product

Resources

About