Generating Question Relevant Captions to Aid Visual Question Answering

Wu, Jialin; Hu, Zeyuan; Mooney, Raymond J.

doi:10.18653/v1/p19-1348

Cited by 40 publications

(25 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another NLP model to produce explanations about an image is tackled by the problem of visual question answering [32], specially useful for the blind 3 or image captioning projects 456 . Generating questions that can be answered by a DNN's output caption can improve explainability and quality of image captioning models [33]. RQ 3.…”

Section: Visual Question Answering Modelsmentioning

confidence: 99%

Accessible Cultural Heritage through Explainable Artificial Intelligence

Díaz-Rodríguez

Pisoni

2020

Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization

View full text Add to dashboard Cite

Ethics Guidelines for Trustworthy AI advocate for AI technology that is, among other things, more inclusive. Explainable AI (XAI) aims at making state of the art opaque models more transparent, and defends AI-based outcomes endorsed with a rationale explanation, i.e., an explanation that has as target the non-technical users. XAI and Responsible AI principles defend the fact that the audience expertise should be included in the evaluation of explainable AI systems. However, AI has not yet reached all public and audiences, some of which may need it the most. One example of domain where accessibility has not much been influenced by the latest AI advances is cultural heritage. We propose including minorities as special user and evaluator of the latest XAI techniques. In order to define catalytic scenarios for collaboration and improved user experience, we pose some challenges and research questions yet to address by the latest AI models likely to be involved in such synergy.

show abstract

Section: Visual Question Answering Modelsmentioning

confidence: 99%

Accessible Cultural Heritage through Explainable Artificial Intelligence

Díaz-Rodríguez

Pisoni

2020

Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization

View full text Add to dashboard Cite

show abstract

“…This may be because the visual ImageNet feature is struggling to learn models for answering textual questions. Furthermore, the concepts encode richer information in that the words have fewer structural constraints and can easily include the attributes of and relation among multiple objects, as revealed in [47]. STAGE [27] is equipped with additional temporal supervision and a combination of global and local features, however, achieving lower results than ours.…”

Section: Modelsmentioning

confidence: 99%

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Liu

Zhu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. The proposed DHTCN explores multiple temporal scales by building hierarchical temporal convolutional network. In each temporal convolutional layer, two key components, namely AttLSTM and QA-Aware Dynamic Normalization, are introduced to capture the temporal dependency and the multimodal interaction in a dynamic and fine-grained manner. To enable sufficient exploitation of both question and answer choices, we increase the depth of QA pairs with a stack of nonlinear layers, and exploit QA pairs in each layer of the network. Extensive experiments are conducted on two widely used datasets: TVQA and MovieQA, demonstrating the effectiveness of DHTCN. Our model obtains state-of-the-art results on the both datasets. CCS CONCEPTS • Computing methodologies → Natural language processing; Computer vision.

show abstract

“…To overcome these shortcomings, we propose to augment a deep learning architecture that utilizes neural attention with additional, external knowledge about the image. This type of approach has been used in the past (You et al 2016;Wu et al 2016;Kim and Bansal 2019;Wu, Hu, and Mooney 2019); however, our work seeks to take advantage of a different form of knowledge. Our resulting network, which we call the Visual Question Answering-Contextual Information network (VQA-CoIn), improves upon past work by extending it to incorporate semantic information extracted from every regions an image via image descriptions.…”

Section: Introductionmentioning

confidence: 99%

Visual Question Answering Using Semantic Information from Image Descriptions

Tasmia

Nahian

Harrison

2021

FLAIRS

View full text Add to dashboard Cite

In this work, we propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task. The combination of both region based features and region based textual information about the image bolsters a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.

show abstract

Generating Question Relevant Captions to Aid Visual Question Answering

Cited by 40 publications

References 36 publications

Accessible Cultural Heritage through Explainable Artificial Intelligence

Accessible Cultural Heritage through Explainable Artificial Intelligence

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Visual Question Answering Using Semantic Information from Image Descriptions

Contact Info

Product

Resources

About