USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Mehri, Shikib; Eskénazi, Maxine

doi:10.18653/v1/2020.acl-main.64

Cited by 110 publications

(224 citation statements)

References 28 publications

Supporting

Mentioning

219

Contrasting

Unclassified

Order By: Relevance

“…That is, formally, our method is similar to the reference-free automatic evaluation metrics for dialogue agents; both of them evaluate the response given an input utterance and also map into a score. Recently, the novel reference-free metrics for evaluating generated responses such as USR (Mehri and Eskenazi, 2020) or MAUDE (Sinha et al, 2020) ware developed.…”

Section: Relationship With Evaluation Metricmentioning

confidence: 99%

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness

Akama¹,

Yokoi

Suzuki

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Large-scale dialogue datasets have recently become available for training neural dialogue agents. However, these datasets have been reported to contain a non-negligible number of unacceptable utterance pairs. In this paper, we propose a method for scoring the quality of utterance pairs in terms of their connectivity and relatedness. The proposed scoring method is designed based on findings widely shared in the dialogue and linguistics research communities. We demonstrate that it has a relatively good correlation with the human judgment of dialogue quality. Furthermore, the method is applied to filter out potentially unacceptable utterance pairs from a large-scale noisy dialogue corpus to ensure its quality. We experimentally confirm that training data filtered by the proposed method improves the quality of neural dialogue agents in response generation. 1

show abstract

Section: Relationship With Evaluation Metricmentioning

confidence: 99%

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness

Akama¹,

Yokoi

Suzuki

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Pang et al [80] proposed that using the GPT-2 model as the standard to automatically measure the quality of the generated responses, including context coherency, response fluency and diversity, and logical self-consistency. Mehri and Eskenazi [81] proposed an unsupervised automatic evaluation method with less references. They used RoBERTa to automatically measure the quality of the generated responses, and found the results have a high correlation with the effect of human evaluation.…”

Section: Ubuntumentioning

confidence: 99%

Neural Dialogue Generation Methods in Open Domain: A Survey

Sun¹,

Li²

2021

NLPRE

View full text Add to dashboard Cite

Open-Domain Dialogue Generation (human-computer interaction) is an important issue in the field of Natural Language Processing (NLP). Because of the improvement of deep learning techniques, a large number of neural dialogue generative methods were proposed to generate better responses. In this survey, we elaborated the research history of these existing generative methods, and then roughly divided them into six categories, i.e., Encoder-Decoder framework-based methods, Hierarchical Recurrent Encoder-Decoder (HRED)-based methods, Variational Autoencoder (VAE)-based methods, Reinforcement Learning (RL)-based methods, Generative Adversarial Network (GAN)-based methods, and pretraining-model-based methods. We dived into the methods of each category and gave the detailed discussions of these methods. After that, we presented a comparison among the different categories of methods and analyzed their advantages and disadvantages. We enumerated some open access public datasets and some commonly used automatic evaluating metrics. Finally, we discuss some possible research directions that can take the research of neural dialogue generation into a new frontier in the future.

show abstract

“…Metric for Specificity For simplicity of studying the configurability of our proposed metric, we select specificity as our likable quality. Following the use of Roberta in Mehri and Eskenazi (2020) to compute the mask language model (MLM) metric, we use a BERT-based model for consistency with the BERT-VUP and BERT-NUP metrics. Moreover, instead of using both (c, r), as in Mehri and Eskenazi (2020), we only use the response r to ensure the independence from the context c. Therefore, for a response r with m words, we sequentially mask one word at a time and feed it into BERT-MLM to predict negative log-likelihood (MLM-Likelihood) of all masked words.…”

Section: Metrics For Fundamental Aspectsmentioning

confidence: 99%

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Phy¹,

Zhao²,

Aizawa³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups: understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy 1 . We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics. Context: I'm sorry. It's out of stock now. Could you come by again next week?

show abstract

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Cited by 110 publications

References 28 publications

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness

Neural Dialogue Generation Methods in Open Domain: A Survey

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Contact Info

Product

Resources

About