Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Liang, Weixin; Zou, James; Yu, Zhou

doi:10.18653/v1/2020.acl-main.126

Cited by 26 publications

(21 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, self-reported ratings suffer from bias and variance among different users . Denoising human ratings is still an open research problem (Liang et al, 2020e;.…”

Section: Open-domain Dialog System Evaluationmentioning

confidence: 99%

“…A major bottleneck of these methods is that they require hand-labeling many dialog samples for individual datasets. Although Liang et al (2020e) denoise user self-reported ratings with the Shapley algorithm for dialog system evaluation, their method cannot be directly applied to dialogs without user ratings as in our setting. Our work is focusing on the problem that it is expensive and difficult to obtain user ratings.…”

Section: User Engagement In Dialogsmentioning

confidence: 99%

“…Learning from weak supervision reduces annotation costs by utilizing noisy but cost-efficient labels (Ratner et al, 2020(Ratner et al, , 2016Liang et al, 2020e). One of the most popular forms of weak supervision is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels for relationship extraction tasks (Bunescu and Mooney, 2007;Mintz et al, 2009;Hancock et al, 2018).…”

Section: Learning From Weak Supervisionmentioning

confidence: 99%

“…(2) Currently, self-reported user ratings are widely used to evaluate open-domain dialogs. However, self-reported ratings suffer from bias and variance among different users (Liang et al, 2020e). Although we could tell which dialog system is better by running statistical tests on a large number of noisy ratings, it is challenging to locate dialogs with bad performance reliably.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Liang¹,

Liang²,

Yu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Open-domain dialog systems have a usercentric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.

show abstract

“…However, self-reported ratings suffer from bias and variance among different users . Denoising human ratings is still an open research problem (Liang et al, 2020e;.…”

Section: Open-domain Dialog System Evaluationmentioning

confidence: 99%

Section: User Engagement In Dialogsmentioning

confidence: 99%

Section: Learning From Weak Supervisionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Liang¹,

Liang²,

Yu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…Sandbank et al (2018) present an approach for classifying low-quality conversations in commercial conversational assistants. Liang et al (2020) argued against the feasability of conversation-level quality prediction on a Likertscale and present a pairwise comparison model instead using methods that compensated for the high noise in user scores. Choi et al (2019) presents methods for both predicting user satisfaction and detecting conversation breakdowns at the turn level.…”

Section: Related Workmentioning

confidence: 99%

What Went Wrong? Explaining Overall Dialogue Quality through Utterance-Level Impacts

Finch

Choi

2021

Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

View full text Add to dashboard Cite

Improving user experience of a dialogue system often requires intensive developer effort to read conversation logs, run statistical analyses, and intuit the relative importance of system shortcomings. This paper presents a novel approach to automated analysis of conversation logs that learns the relationship between user-system interactions and overall dialogue quality. Unlike prior work on utterance-level quality prediction, our approach learns the impact of each interaction from the overall user rating without utterance-level annotation, allowing resultant model conclusions to be derived on the basis of empirical evidence and at low cost. Our model identifies interactions that have a strong correlation with the overall dialogue quality in a chatbot setting. Experiments show that the automated analysis from our model agrees with expert judgments, making this work the first to show that such weakly-supervised learning of utterance-level quality prediction is highly achievable.

show abstract

Recent advances in deep learning based dialogue systems: a systematic survey

et al. 2022

View full text Add to dashboard Cite

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Cited by 26 publications

References 43 publications

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

What Went Wrong? Explaining Overall Dialogue Quality through Utterance-Level Impacts

Recent advances in deep learning based dialogue systems: a systematic survey

Contact Info

Product

Resources

About