Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.126
|View full text |Cite
|
Sign up to set email alerts
|

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Abstract: Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly referencebased. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, selfreported user rating suffers from bias and variance among different users. To allevia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
19
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(21 citation statements)
references
References 43 publications
1
19
0
Order By: Relevance
“…However, self-reported ratings suffer from bias and variance among different users . Denoising human ratings is still an open research problem (Liang et al, 2020e;.…”
Section: Open-domain Dialog System Evaluationmentioning
confidence: 99%
See 3 more Smart Citations
“…However, self-reported ratings suffer from bias and variance among different users . Denoising human ratings is still an open research problem (Liang et al, 2020e;.…”
Section: Open-domain Dialog System Evaluationmentioning
confidence: 99%
“…A major bottleneck of these methods is that they require hand-labeling many dialog samples for individual datasets. Although Liang et al (2020e) denoise user self-reported ratings with the Shapley algorithm for dialog system evaluation, their method cannot be directly applied to dialogs without user ratings as in our setting. Our work is focusing on the problem that it is expensive and difficult to obtain user ratings.…”
Section: User Engagement In Dialogsmentioning
confidence: 99%
See 2 more Smart Citations
“…Sandbank et al (2018) present an approach for classifying low-quality conversations in commercial conversational assistants. Liang et al (2020) argued against the feasability of conversation-level quality prediction on a Likertscale and present a pairwise comparison model instead using methods that compensated for the high noise in user scores. Choi et al (2019) presents methods for both predicting user satisfaction and detecting conversation breakdowns at the turn level.…”
Section: Related Workmentioning
confidence: 99%