Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.64
|View full text |Cite
|
Sign up to set email alerts
|

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Abstract: The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
219
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 110 publications
(224 citation statements)
references
References 28 publications
4
219
0
1
Order By: Relevance
“…That is, formally, our method is similar to the reference-free automatic evaluation metrics for dialogue agents; both of them evaluate the response given an input utterance and also map into a score. Recently, the novel reference-free metrics for evaluating generated responses such as USR (Mehri and Eskenazi, 2020) or MAUDE (Sinha et al, 2020) ware developed.…”
Section: Relationship With Evaluation Metricmentioning
confidence: 99%
“…That is, formally, our method is similar to the reference-free automatic evaluation metrics for dialogue agents; both of them evaluate the response given an input utterance and also map into a score. Recently, the novel reference-free metrics for evaluating generated responses such as USR (Mehri and Eskenazi, 2020) or MAUDE (Sinha et al, 2020) ware developed.…”
Section: Relationship With Evaluation Metricmentioning
confidence: 99%
“…Pang et al [80] proposed that using the GPT-2 model as the standard to automatically measure the quality of the generated responses, including context coherency, response fluency and diversity, and logical self-consistency. Mehri and Eskenazi [81] proposed an unsupervised automatic evaluation method with less references. They used RoBERTa to automatically measure the quality of the generated responses, and found the results have a high correlation with the effect of human evaluation.…”
Section: Ubuntumentioning
confidence: 99%
“…Metric for Specificity For simplicity of studying the configurability of our proposed metric, we select specificity as our likable quality. Following the use of Roberta in Mehri and Eskenazi (2020) to compute the mask language model (MLM) metric, we use a BERT-based model for consistency with the BERT-VUP and BERT-NUP metrics. Moreover, instead of using both (c, r), as in Mehri and Eskenazi (2020), we only use the response r to ensure the independence from the context c. Therefore, for a response r with m words, we sequentially mask one word at a time and feed it into BERT-MLM to predict negative log-likelihood (MLM-Likelihood) of all masked words.…”
Section: Metrics For Fundamental Aspectsmentioning
confidence: 99%