ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering

Liu, Yun; Zhang, Xiaoming; Zhao, Zhiyun; Zhang, Bo; Cheng, Lei; Li, Zhoujun

doi:10.1109/tcyb.2020.3029423

Cited by 22 publications

(6 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, H-CFIM gives lower accuracy compared to V-CFIM as it blends information from different attention paths but avoids the clash between the top-down and bottom-up paths. In Table 5, we have compared the performance of the proposed VQA method with 22 state-of-the-art methods, for example, Re-attention, 23 ALSA, 50 IASSM, 51 MRA-Net 35 and CAM 52 on both test-dev and test-std sets.…”

Section: Based On the Combination Tcam And Cfimmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Agrawal,

Jalal,

Sharma

2023

Computational Intelligence

View full text Add to dashboard Cite

In Visual question answering (VQA), a natural language answer is generated for a given image and a question related to that image. There is a significant growth in the VQA task by applying an efficient attention mechanism. However, current VQA models use region features or object features that are not adequate to improve the accuracy of generated answers. To deal with this issue, we have used a Two‐way Co‐Attention Mechanism (TCAM), which is capable enough to fuse different visual features (region, object, and concept) from diverse perspectives. These diverse features lead to different sets of answers, and also, there is an inherent relationship between these visual features. We have developed a powerful attention mechanism that uses these two critical aspects by using both bottom‐up and top‐down TCAM to extract discriminative feature information. We have proposed a Collective Feature Integration Module (CFIM) to combine multimodal attention features and thus capture the valuable information from these visual features by employing a TCAM. Further, we have formulated a Vertical CFIM for fusing the features belonging to the same class and a Horizontal CFIM for combining the features belonging to different types, thus balancing the influence of top‐down and bottom‐up co‐attention. The experiments are conducted on two significant datasets, VQA 1.0 and VQA 2.0. On VQA 1.0, the overall accuracy of our proposed method is 71.23 on the test‐dev set and 71.94 on the test‐std set. On VQA 2.0, the overall accuracy of our proposed method is 75.89 on the test‐dev set and 76.32 on the test‐std set. The above overall accuracy clearly reflecting the superiority of our proposed TCAM based approach over the existing methods.

show abstract

Section: Based On the Combination Tcam And Cfimmentioning

confidence: 99%

“…In Table 5, we have compared the performance of the proposed VQA method with 22 state‐of‐the‐art methods, for example, Re‐attention, 23 ALSA, 50 IASSM, 51 MRA‐Net 35 and CAM 52 on both test‐dev and test‐std sets. Step‐by‐step reasoning is used by CAM to generate the compound objects.…”

Section: Performance Analysismentioning

confidence: 99%

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Agrawal,

Jalal,

Sharma

2023

Computational Intelligence

View full text Add to dashboard Cite

show abstract

“…40 based on visual relationship detection, where image features and question vector are used to generate the output. Liu et al 41 presented a supervised attention-based VQA model and designed two attention modules: free-form and detection based, to use the past information for attention learning. Li et al 42 proposed a relation-aware graph attention network to encode an image using a graph.…”

Section: Visual Question Answeringmentioning

confidence: 99%

Graph neural network-based visual relationship and multilevel attention for image captioning

Sharma

Srivastava

2022

J. Electron. Imag.

View full text Add to dashboard Cite

With the remarkable success of the image captioning tasks, visual attention methods have become a vital part of captioning models. However, most attention-based image captioning methods do not consider any relationship among regions, which play a significant role in better image understanding. We proposed an image captioning method based on local relation network using a multilevel attention approach with graph neural network. It not only fully explores the relationship between the object and the image regions but also generates significant and contextbased features corresponding to every region in the image. The attention employed in our work enhances the image representation capability of our method by focusing on a given image region and its related image regions. Thus addressing the relevant contextual information, spatial locations, and deep visual features leads to improve caption generation. We verified the effectiveness of the proposed model by conducting extensive experiments on three benchmark datasets: Flickr30k, MSCOCO, and nocaps. The results show the superiority of the proposed method over the existing methods both in quantitative and qualitative manners. Detailed ablation studies are conducted to communicate how each part would contribute to the final performance.

show abstract

“… 13 , 14 , 15 However, because of the small size of the structure dataset and the lack of detailed knowledge concerning protein–ligand interactions, most of the existing methods are not yet able to effectively learn the attention distribution and accurately capture the true interaction information between proteins and ligands, limiting the predictive performance. 16 Several studies in the fields of visual question answering 17 , 18 and natural language processing 19 , 20 have demonstrated that training attention mechanism in a supervised manner can result in more effective attention distribution and improve model performance significantly, but its effectiveness in building a better protein–ligand binding affinity prediction model remains unclear.…”

Section: Introductionmentioning

confidence: 99%

Protein–ligand binding affinity prediction with edge awareness and supervised attention

Zhang

et al. 2023

iScience

View full text Add to dashboard Cite

ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering

Cited by 22 publications

References 56 publications

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

Graph neural network-based visual relationship and multilevel attention for image captioning

Protein–ligand binding affinity prediction with edge awareness and supervised attention

Contact Info

Product

Resources

About