To what extent do human explanations of model behavior align with actual model behavior?

Input saliency methods have recently become a popular tool for explaining predictions of deep learning models in NLP. Nevertheless, there has been little work investigating methods for aggregating prediction-level explanations to the class level, nor has a framework for evaluating such class explanations been established. We explore explanations based on XLM-R and the Integrated Gradients input attribution method, and propose 1) the Stable Attribution Class Explanation method (SACX) to extract keyword lists of classes in text classification tasks, and 2) a framework for the systematic evaluation of the keyword lists. We find that explanations of individual predictions are prone to noise, but that stable explanations can be effectively identified through repeated training and explanation. We evaluate on web register data and show that the class explanations are linguistically meaningful and distinguishing of the classes.

show abstract

“…For a discussion on the merits of IG, cf. Prasad et al (2021), and Bastings and Filippova (2020) on saliency vs. attention methods in general.…”

Section: Introductionmentioning

confidence: 99%

Explaining Classes through Stable Word Attributions

Rönnqvist¹,

Kyröläinen²,

Myntti³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

show abstract

“…IG has two major advantages: (i) it is based on the gradient calculations and thus can be used to arbitrary neural models; and (ii) it satisfies several desirable properties, for example, the sum of the contributions for each input feature matches the output value (Completeness described in Sundararajan et al, 2017). It has also been actively applied to the analysis of MLM-based models (Hao et al, 2021;Prasad et al, 2021;Bastings et al, 2022;Kobayashi et al, 2023).…”

Section: Integrated Gradients (Ig)mentioning

confidence: 99%

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

Kurita,

Kobayashi,

Yokoi

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on informationtheoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.

show abstract

“…While growing efforts are made for evaluating interpretability approaches for NLP models (Atanasova et al, 2020;DeYoung et al, 2020;Prasad et al, 2021;Nguyen, 2018;Hase and Bansal, 2020;Nguyen and Martínez, 2020;Jacovi and Goldberg, 2020), the evaluation is not domainspecific. Therefore, the benchmarking miss to consider specific sensitive problems and biases that are proper of the hate speech domain on which the explanation validation must focus.…”

Section: Related Workmentioning

confidence: 99%

Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection

Attanasio¹,

Nozza²,

Pastor³

et al. 2022

Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

View full text Add to dashboard Cite

Warning: This paper contains examples of language that some people may find offensive.Transformer-based Natural Language Processing models have become the standard for hate speech detection. However, the unconscious use of these techniques for such a critical task comes with negative consequences. Various works have demonstrated that hate speech classifiers are biased. These findings have prompted efforts to explain classifiers, mainly using attribution methods. In this paper, we provide the first benchmark study of interpretability approaches for hate speech detection. We cover four post-hoc token attribution approaches to explain the predictions of Transformer-based misogyny classifiers in English and Italian. Further, we compare generated attributions to attention analysis. We find that only two algorithms provide faithful explanations aligned with human expectations. Gradient-based methods and attention, however, show inconsistent outputs, making their value for explanations questionable for hate speech detection tasks.

show abstract

To what extent do human explanations of model behavior align with actual model behavior?

Cited by 10 publications

References 41 publications

Explaining Classes through Stable Word Attributions

Explaining Classes through Stable Word Attributions

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection

Contact Info

Product

Resources

About