Grounding Answers for Visual Questions Asked by Visually Impaired People

Chen, Chongyan; Anjum, Samreen; Gurari, Danna

doi:10.48550/arxiv.2202.01993

Cited by 3 publications

(13 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This method relies on object detection networks to provide the anchor points, which adds an extra task for the network to learn. Also, using large generic corpora may not improve the accuracy for special datasets such as VizWiz-VQA-Grounding [3]. In contrast, our proposed method relies only on the feature-maps and does not define another task.…”

Section: Comparison To Existing Methodsmentioning

confidence: 95%

“…Recently, many applications have been made based on deep neural networks that are expected to be used by the end-users of different platforms. For example, answer grounding methods are helpful in assistive technologies for people with vision impairments [3]. In order to use such technologies offline, it is crucial to implement the method for each platform.…”

Section: Methodsmentioning

confidence: 99%

“…The answer grounding task is defined as detecting the pixels that can provide evidence for the answer to a given question regarding an image [3]. In other words, the task is to return the image regions used to arrive at the answer for a given visual question (question-image pair) with an answer.…”

Section: Answer Groundingmentioning

confidence: 99%

“…Since only a few models have been evaluated on this dataset, we train three more models for a fair comparison. Similar to [3], we train top-performing methods that their code is publicly available; specifically, LXMERT [36], VinVL [45], and MAC-Caps [39]. We use the same setup as in [3] to train the three models with the difference that we train the models on the TextVQA-X training set.…”

Section: Textvqa-xmentioning

confidence: 99%

See 3 more Smart Citations

Semantic Segmentation Using Neural Ordinary Differential Equations

Khoshsirat

Kambhamettu

2022

Advances in Visual Computing

View full text Add to dashboard Cite

Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant featuremaps channels based on sentence embedding. We start our design with a well-known attention method, and by making minor modifications, we improve the results to achieve state-of-the-art accuracy. The flexibility of our method makes it easy to use different pre-trained backbone networks, and its simplicity makes it easy to understand and be re-implemented. We demonstrate the effectiveness of our method on the TextVQA-X, VQS, VQA-X, and VizWiz-VQA-Grounding datasets. We perform multiple ablation studies to show the effectiveness of our design choices.

show abstract

Section: Comparison To Existing Methodsmentioning

confidence: 95%

Section: Methodsmentioning

confidence: 99%

Section: Answer Groundingmentioning

confidence: 99%

Section: Textvqa-xmentioning

confidence: 99%

See 2 more Smart Citations

Semantic Segmentation Using Neural Ordinary Differential Equations

Khoshsirat

Kambhamettu

2022

Advances in Visual Computing

View full text Add to dashboard Cite

show abstract

“…Visual Question Answering (VQA) is a VL task that has obtained a fundamental role in the evolution of various interactive VL AI systems, such as Visual Dialogue [10], Text-Image Retrieval [11] and Visual Commonsense Reasoning [12]. To this end, there is an extensive range of real-world applications that benefit significantly from the new advances around the VQA task, such as aiding systems for visually impaired individuals [13,14] and self-driving cars [15].…”

Section: Introductionmentioning

confidence: 99%

Knowledge-Based Counterfactual Queries for Visual Question Answering

Stoikou¹,

Lymperaiou²,

Stamou³

2023

Preprint

View full text Add to dashboard Cite

Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.

show abstract

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Sheng

2023

PLoS ONE

View full text Add to dashboard Cite

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.

show abstract

Grounding Answers for Visual Questions Asked by Visually Impaired People

Cited by 3 publications

References 27 publications

Semantic Segmentation Using Neural Ordinary Differential Equations

Semantic Segmentation Using Neural Ordinary Differential Equations

Knowledge-Based Counterfactual Queries for Visual Question Answering

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Contact Info

Product

Resources

About