Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP 2021
DOI: 10.18653/v1/2021.blackboxnlp-1.1
|View full text |Cite
|
Sign up to set email alerts
|

To what extent do human explanations of model behavior align with actual model behavior?

Abstract: Given the increasingly prominent role NLP models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior. Using Natural Language Inference (NLI) as a case study, we investigate the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. More specifically, we define three alignment metrics that quantify how well natural language explanations align with model sensitivity to inpu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 41 publications
0
7
0
Order By: Relevance
“…For a discussion on the merits of IG, cf. Prasad et al (2021), and Bastings and Filippova (2020) on saliency vs. attention methods in general.…”
Section: Introductionmentioning
confidence: 99%
“…For a discussion on the merits of IG, cf. Prasad et al (2021), and Bastings and Filippova (2020) on saliency vs. attention methods in general.…”
Section: Introductionmentioning
confidence: 99%
“…IG has two major advantages: (i) it is based on the gradient calculations and thus can be used to arbitrary neural models; and (ii) it satisfies several desirable properties, for example, the sum of the contributions for each input feature matches the output value (Completeness described in Sundararajan et al, 2017). It has also been actively applied to the analysis of MLM-based models (Hao et al, 2021;Prasad et al, 2021;Bastings et al, 2022;Kobayashi et al, 2023).…”
Section: Integrated Gradients (Ig)mentioning
confidence: 99%
“…While growing efforts are made for evaluating interpretability approaches for NLP models (Atanasova et al, 2020;DeYoung et al, 2020;Prasad et al, 2021;Nguyen, 2018;Hase and Bansal, 2020;Nguyen and Martínez, 2020;Jacovi and Goldberg, 2020), the evaluation is not domainspecific. Therefore, the benchmarking miss to consider specific sensitive problems and biases that are proper of the hate speech domain on which the explanation validation must focus.…”
Section: Related Workmentioning
confidence: 99%