2020
DOI: 10.48550/arxiv.2010.06283
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Abstract: Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…On such a topic, while a variety of evaluation methods and approaches have been proposed [63], it is still argued that the best way to assess the interpretability of black-box models is through user experiments and user-centred evaluations as there is no guarantee for the correctness of automated metrics in evaluating explainability [64] and high explainability metric scores do not necessarily reflect high human interpretability in real-world scenarios [64,65]. The same is true for well-known metrics (e.g., F1-score) [66]. Supporting such claims, Fel et al [65] conducted experiments to evaluate the capability of human participants to leverage representative attribution methods to learn to predict the decision of various image classifiers.…”
Section: Evaluation Of Explainability Methods By Means Of Human Knowl...mentioning
confidence: 99%
See 1 more Smart Citation
“…On such a topic, while a variety of evaluation methods and approaches have been proposed [63], it is still argued that the best way to assess the interpretability of black-box models is through user experiments and user-centred evaluations as there is no guarantee for the correctness of automated metrics in evaluating explainability [64] and high explainability metric scores do not necessarily reflect high human interpretability in real-world scenarios [64,65]. The same is true for well-known metrics (e.g., F1-score) [66]. Supporting such claims, Fel et al [65] conducted experiments to evaluate the capability of human participants to leverage representative attribution methods to learn to predict the decision of various image classifiers.…”
Section: Evaluation Of Explainability Methods By Means Of Human Knowl...mentioning
confidence: 99%
“…The same approach is applicable to the evaluation of the interpretability of black-box models, i.e., directly understanding the intrinsic explainability of a model [67]. Such evaluations are usually achieved through user questionnaires [66,[68][69][70] whose questions vary depending on the nature of the experiment, model, etc. On the other hand, comparing the interpretability of different explainability methods to choose the best suited one requires the design and implementation of ad hoc human-in-the-loop approaches.…”
Section: Evaluation Of Explainability Methods By Means Of Human Knowl...mentioning
confidence: 99%