2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00268
|View full text |Cite
|
Sign up to set email alerts
|

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Abstract: Many vision and language models suffer from poor visual grounding -often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sensitive to the same input regions as humans. Our approach optimizes the alignment between human attention maps and gr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
82
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 168 publications
(89 citation statements)
references
References 27 publications
(60 reference statements)
0
82
0
Order By: Relevance
“…[151], [154], [158] Local approximation LIME ,SHAP,HINT Lime models change in prediction by a change in input for a local data point; Shap is the average contribution of all data points in a prediction. HINT looks at the same image regions as humans to make predictions.…”
Section: Ementioning
confidence: 99%
“…[151], [154], [158] Local approximation LIME ,SHAP,HINT Lime models change in prediction by a change in input for a local data point; Shap is the average contribution of all data points in a prediction. HINT looks at the same image regions as humans to make predictions.…”
Section: Ementioning
confidence: 99%
“…Aiming to emphasize the significance of visual information, they weakened unwanted correlations between questions and answers while we appropriately use information in questions to guide the vision-based concept verification. Selvaraju et al (2019) proposed a human importance-aware network tuning method that uses human supervision to improve visual grounding. They forced the model to focus on the right region by optimizing the alignment between human attention maps and gradient-based network importance.…”
Section: Related Workmentioning
confidence: 99%
“…Implementation Detail. We build our model on the bottom-up and top-down attention (UpDn) method (Anderson et al 2018) as (Ramakrishnan, Agrawal, and Lee 2018) and (Selvaraju et al 2019). The UpDn utilizes two kinds of attention mechanisms: bottom-up attention and top-down attention.…”
Section: Experiments Datasets and Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…With the aim of producing clarifying explanations on why a particular image caption model fails or succeeds, since a deep neural network (DNN) is considered a black box model hard to inspect, recent strategies make sure that the objects the captions talk about are indeed detected in the images [24,25]. Textual explanations can also contribute to make vision and language models more robust, in the sense of being more semantically grounded [26].…”
Section: Image Captioning Modelsmentioning
confidence: 99%