Visual question answering with attention transfer and a cross-modal gating mechanism

Li, Wei; Sun, Jianhui; Liu, Ge; Zhao, Linglan; Fang, Xiangzhong

doi:10.1016/j.patrec.2020.02.031

Cited by 14 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The models are pretrained on large-scale multi-modal datasets with self-supervised objectives. Further finetuning them on specific tasks leads to new state-of-the-art records on several multi-modal challenges such as visual question answering [2,8,22], image-text retrieval [20,24], and visual commonsense reasoning [40]. Murahari et al [28] adapt the two-stream ViLBERT [25] to VisDial via a two-step finetuning and boost the evaluation metrics by a large margin.…”

Section: Visual Dialogmentioning

confidence: 99%

VD-PCR: Improving visual dialog with pronoun coreference resolution

Zhang

Hong

et al. 2022

Pattern Recognition

View full text Add to dashboard Cite

Section: Visual Dialogmentioning

confidence: 99%

VD-PCR: Improving visual dialog with pronoun coreference resolution

Zhang

Hong

et al. 2022

Pattern Recognition

View full text Add to dashboard Cite

“…Despite the impressive performance of AI algorithm in various fields, their safety and reliability is still a concern. Recent studies have achieved successful performance in areas like image [5] and text classification [18], object detection [10], segmentation [9], image captioning [20], visual question answer [8] and graph scene generation [19], and some tasks obtain near-perfect results. However, AI has not been fully deployed in sensitive fields like autonomous driving, medical diagnosing, or assistance for socially vulnerable groups.…”

Section: Introductionmentioning

confidence: 99%

Building Safe and Reliable AI Systems for Safety Critical Tasks with Vision-Language Processing

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Chen et al have improved the robustness of VQA approach by synthesizing the Counterfactual samples for training [3]. Li et al have employed the attention based mechanism through transfer learning alongwith a cross-modal gating approach to improve the VQA performance [15]. Huang et al [8] have utilized the graph based convolutional network to increase the encoding relational informatoin for VQA.…”

Section: Introductionmentioning

confidence: 99%

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Srivastava

Murali

Dubey

et al. 2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic 'common sense' questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA model, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions.

show abstract

Visual question answering with attention transfer and a cross-modal gating mechanism

Cited by 14 publications

References 5 publications

VD-PCR: Improving visual dialog with pronoun coreference resolution

VD-PCR: Improving visual dialog with pronoun coreference resolution

Building Safe and Reliable AI Systems for Safety Critical Tasks with Vision-Language Processing

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Contact Info

Product

Resources

About