Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.4
|View full text |Cite
|
Sign up to set email alerts
|

Designing Precise and Robust Dialogue Response Evaluators

Abstract: Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and gen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
35
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 33 publications
(39 citation statements)
references
References 7 publications
0
35
0
Order By: Relevance
“…At the linguistic level, once speech has arrived from the user and the system responds to it, the monitor evaluates the appropriateness of the response turn by turn. We can turn to a large variety of dialogue evaluation models that have been applied to linguistic input [8,10,11], which decide if the response from the system is appropriate enough. In the current prototype, we fine-tune a large-scale pre-trained model BERT [12] with appropriateness labels annotated in our previous study [2], where each system response was annotated by binary: appropriate or not.…”
Section: Dialogue Monitoring For Detecting Breakdownmentioning
confidence: 99%
“…At the linguistic level, once speech has arrived from the user and the system responds to it, the monitor evaluates the appropriateness of the response turn by turn. We can turn to a large variety of dialogue evaluation models that have been applied to linguistic input [8,10,11], which decide if the response from the system is appropriate enough. In the current prototype, we fine-tune a large-scale pre-trained model BERT [12] with appropriateness labels annotated in our previous study [2], where each system response was annotated by binary: appropriate or not.…”
Section: Dialogue Monitoring For Detecting Breakdownmentioning
confidence: 99%
“…We find that the model trained with our negative samples alongside random negative samples shows a higher correlation with human evaluations than the models trained only on random negative samples, in experiments using two datasets (Zhao et al, 2020). We also find evidence that automatic evaluation systems trained with the negative samples generated by our proposed method can make decisions closer to human judgment than those without.…”
Section: Introductionmentioning
confidence: 53%
“…To measure the correlation between model predictions and human evaluations, we use the responseevaluation dataset proposed by Zhao et al (2020). The dataset contains dialogue histories, machinegenerated responses, golden responses, and appropriateness scores evaluated by human annotators.…”
Section: Datasetmentioning
confidence: 99%
See 2 more Smart Citations