Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Gupta, Prakhar; Mehri, Shikib; Zhao, Tiancheng; Pavel, Amy; Eskénazi, Maxine; Bigham, Jeffrey P.

doi:10.18653/v1/w19-5944

Cited by 52 publications

(55 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A wellknown reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references (Gupta et al, 2019) is not addressed in methods that are lexical-level based (Papineni et al, 2002;Lin, 2004;Banerjee and Lavie, 2005), embedding based (Rus and Lintean, 2012;Forgues et al, 2014), or learning based (Tao et al, 2018;Lowe et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Liang

Zou

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly referencebased. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, selfreported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task. Our implementation is available at https://github.com/Weixin-Liang/ dialog_evaluation_CMADE.

show abstract

Section: Related Workmentioning

confidence: 99%

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Liang

Zou

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…For image captioning, we use English Flickr30k (Young et al, 2014), and the STAIR (Yoshikawa et al, 2017) dataset which provides Japanese captions for COCO images. We also explore the multi-turn dialogue dataset DailyDialog (Li et al, 2017) which contains conversations that cover 10 different daily life topics, and its multi-reference test set (Gupta et al, 2019). Table 4 summarises the statistics about the datasets explored for this experiment.…”

Section: Single Representative Sentencementioning

confidence: 99%

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Çağlayan¹,

Madhyastha²,

Specia³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.

show abstract

“…They measured the correlation, using a regression-based approach, between systems' responses and a large set of both positive and negative human references. Gupta et al (2019) extended the test split of DailyDialog (1k dialogues) with multiple references. They compared the results of using single-reference versus multiple-reference data.…”

Section: Dialogue Evaluationmentioning

confidence: 99%

Survey on evaluation methods for dialogue systems

et al. 2020

View full text Add to dashboard Cite

In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost-and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.

show abstract

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Cited by 52 publications

References 32 publications

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Survey on evaluation methods for dialogue systems

Contact Info

Product

Resources

About