Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Selvaraju, Ramprasaath R.; Lee, Stefan; Shen, Yilin; Jin, Hongxia; Ghosh, Shalini; Heck, Larry; Batra, Dhruv; Parikh, Devi

doi:10.1109/iccv.2019.00268

Cited by 168 publications

(89 citation statements)

References 27 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[151], [154], [158] Local approximation LIME ,SHAP,HINT Lime models change in prediction by a change in input for a local data point; Shap is the average contribution of all data points in a prediction. HINT looks at the same image regions as humans to make predictions.…”

Section: Ementioning

confidence: 99%

A Review on Explainability in Multimodal Deep Neural Nets

2021

View full text Add to dashboard Cite

Artificial Intelligence techniques powered by deep neural nets have achieved much success in several application domains, most significantly and notably in the Computer Vision applications and Natural Language Processing tasks. Surpassing human-level performance propelled the research in the applications where different modalities amongst language, vision, sensory, text play an important role in accurate predictions and identification. Several multimodal fusion methods employing deep learning models are proposed in the literature. Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability. This has given rise to the quest for model interpretability and explainability, more so in the complex tasks involving multimodal AI methods. This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets, especially for the vision and language tasks. Several topics on multimodal AI and its applications for generic domains have been covered in this paper, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain.INDEX TERMS deep multimodal learning, explainable AI, interpretability, survey, trends, vision and language research, XAI.

show abstract

Section: Ementioning

confidence: 99%

A Review on Explainability in Multimodal Deep Neural Nets

2021

View full text Add to dashboard Cite

show abstract

“…Aiming to emphasize the significance of visual information, they weakened unwanted correlations between questions and answers while we appropriately use information in questions to guide the vision-based concept verification. Selvaraju et al (2019) proposed a human importance-aware network tuning method that uses human supervision to improve visual grounding. They forced the model to focus on the right region by optimizing the alignment between human attention maps and gradient-based network importance.…”

Section: Related Workmentioning

confidence: 99%

“…Implementation Detail. We build our model on the bottom-up and top-down attention (UpDn) method (Anderson et al 2018) as (Ramakrishnan, Agrawal, and Lee 2018) and (Selvaraju et al 2019). The UpDn utilizes two kinds of attention mechanisms: bottom-up attention and top-down attention.…”

Section: Experiments Datasets and Experimental Settingsmentioning

confidence: 99%

“…Recent studies (Kafle and Kanan 2017;Agrawal et al 2018;Selvaraju et al 2019) demonstrate that most existing Visual Question Answering (VQA) models overly rely on superficial correlations between questions and answers, i.e., language priors, and ignore image information. For example, they may frequently answer "white" for questions about color, "tennis" for questions about sports, and "yes" for questions beginning with "is there a", no matter what images are given with the questions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Overcoming Language Priors in VQA via Decomposed Linguistic Representations

Jing

Zhang

et al. 2020

AAAI

View full text Add to dashboard Cite

Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

show abstract

“…With the aim of producing clarifying explanations on why a particular image caption model fails or succeeds, since a deep neural network (DNN) is considered a black box model hard to inspect, recent strategies make sure that the objects the captions talk about are indeed detected in the images [24,25]. Textual explanations can also contribute to make vision and language models more robust, in the sense of being more semantically grounded [26].…”

Section: Image Captioning Modelsmentioning

confidence: 99%

Accessible Cultural Heritage through Explainable Artificial Intelligence

Díaz-Rodríguez

Pisoni

2020

Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization

View full text Add to dashboard Cite

Ethics Guidelines for Trustworthy AI advocate for AI technology that is, among other things, more inclusive. Explainable AI (XAI) aims at making state of the art opaque models more transparent, and defends AI-based outcomes endorsed with a rationale explanation, i.e., an explanation that has as target the non-technical users. XAI and Responsible AI principles defend the fact that the audience expertise should be included in the evaluation of explainable AI systems. However, AI has not yet reached all public and audiences, some of which may need it the most. One example of domain where accessibility has not much been influenced by the latest AI advances is cultural heritage. We propose including minorities as special user and evaluator of the latest XAI techniques. In order to define catalytic scenarios for collaboration and improved user experience, we pose some challenges and research questions yet to address by the latest AI models likely to be involved in such synergy.

show abstract

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Cited by 168 publications

References 27 publications

A Review on Explainability in Multimodal Deep Neural Nets

A Review on Explainability in Multimodal Deep Neural Nets

Overcoming Language Priors in VQA via Decomposed Linguistic Representations

Accessible Cultural Heritage through Explainable Artificial Intelligence

Contact Info

Product

Resources

About