Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.575
|View full text |Cite
|
Sign up to set email alerts
|

F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Abstract: Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(14 citation statements)
references
References 36 publications
0
14
0
Order By: Relevance
“…Given a model with a target head o t and a steered head o s , our goal is to understand the behaviour of o s on inputs where o t provides the prediction. To this end, we focus on head combinations, where o s is expressive enough to explain the outputs of o t , but unlike most prior work aiming to explain by examining model outputs (Perez et al, 2019;Schuff et al, 2020;, o s is not explicitly trained for this purpose. Concretely, our analysis covers three settings, illustrated in Figure 2 and summarized in Table 1.…”
Section: Overview: Experiments and Findingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Given a model with a target head o t and a steered head o s , our goal is to understand the behaviour of o s on inputs where o t provides the prediction. To this end, we focus on head combinations, where o s is expressive enough to explain the outputs of o t , but unlike most prior work aiming to explain by examining model outputs (Perez et al, 2019;Schuff et al, 2020;, o s is not explicitly trained for this purpose. Concretely, our analysis covers three settings, illustrated in Figure 2 and summarized in Table 1.…”
Section: Overview: Experiments and Findingsmentioning
confidence: 99%
“…To the best of our knowledge, this is the first work that analyzes the outputs of the non-target heads. Previous work used additional output heads to generate explanations for model predictions (Perez et al, 2019;Schuff et al, 2020;. Specifically, recent work has explored utilization of summarization modules for explainable QA (Nishida et al, 2019;Deng et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we use large-RoBERTa [9] as P LM , which can be easily replaced with other pre-trained language models (such as BERT [22], etc) without significant influence on the results. The detailed process is as follows: [43], we use the overlap between the predicted answer âi and the input answer a i to evaluate the informativeness of the evidence e i . Specifically, we use F1 score, which is widely used to evaluate the accuracy of answer prediction in machine reading comprehension [42], as the informativeness score I(e i ) of the evidence e i .…”
Section: B Metrics Of a Good Evidencementioning
confidence: 99%
“…Furthermore, the difficulty is deteriorated by simultaneously guaranteeing readability of the evidences, which is necessary for ensuring user-friendliness of QA systems. Most previous relevant efforts, such as Schuff et al [43], only focus on informativeness of evidences without considering the conciseness of the evidences. Although human experts are further employed to write informative-yet-concise evidences in QA systems [40], it inevitably incurs unaffordable human cost.…”
Section: Introductionmentioning
confidence: 99%
“…However, model weaknesses can stay unnoticed using automatic scores alone. Moreover, Schuff et al (2020) showed that automatic scores are not necessarily correlated to human-perceived model quality. Thus, human evaluation is a crucial step in the development of user-centered explainable AI systems.…”
mentioning
confidence: 99%