Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Ghazarian, Sarik; Weischedel, Ralph; Galstyan, Aram; Peng, Nanyun

doi:10.1609/aaai.v34i05.6283

Cited by 32 publications

(50 citation statements)

References 18 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous research also confirms that incorporating user engagement as real-time feedback benefits dialog policy learning (Yu et al, 2016). One of the most costly bottlenecks of learning to detect user disengagement is to annotate many turn-level user engagement labels (Ghazarian et al, 2020). In addition, the data annotation process becomes more expensive and challenging for privacy-sensitive dialog corpora, due to the privacy concerns in crowdsourcing (Xia and McKernan, 2020).…”

Section: Introductionmentioning

confidence: 61%

“…Example proxy metrics include conversation length like number of dialog turns , and conversational breadth like topical diversity . Sporadic attempts have been made to detecting user disengagement in dialogs (Yu et al, 2004;Ghazarian et al, 2020;Choi et al, 2019). A major bottleneck of these methods is that they require hand-labeling many dialog samples for individual datasets.…”

Section: User Engagement In Dialogsmentioning

confidence: 99%

“…Second, HERALD could quantify user engagement and be used as an automatic dialog evaluation metric. It could locate dialogs with poor user experience reliably to improve dialog system quality (Ghazarian et al, 2020;Choi et al, 2019). Third, user engagement is an essential objective of dialog systems, but few dialog datasets with user engagement ratings are available.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Liang¹,

Liang²,

Yu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Open-domain dialog systems have a usercentric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.

show abstract

Section: Introductionmentioning

confidence: 61%

Section: User Engagement In Dialogsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Liang¹,

Liang²,

Yu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…We use the contextualized Ruber metric for this purpose (Ghazarian et al, 2019). At the end, since in open-domain dialogue systems, it is necessary to have both relevant and interesting responses to make the user feel satisfied (Ghazarian et al, 2020), we further validate systems based on the engagingness of responses. We compute engagingness as the probability score of the engaging class predicted by Ghazarian et al (2020)'s proposed engagement classifier.…”

Section: Automatic Evaluationsmentioning

confidence: 99%

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

2021

View full text Add to dashboard Cite

We present the first multi-task learning model-named PhoNLP-for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0. Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github. com/VinAIResearch/PhoNLP.

show abstract

“…They are able to assess language fluency and context coherence of dialogue responses to certain extent. However, they are limited when assessing other aspects such as logical consistency, semantic appropriateness [11] and user engagement [12] systematically. For machine evaluation to approach human performance, we identify two research problems: a) to define evaluation metrics that describe different dialogue aspects, and b) to establish a holistic solution that considers the inter-dependence of the different aspects.…”

Section: Introductionmentioning

confidence: 99%

D-Score: Holistic Dialogue Evaluation Without Reference

Zhang

Lee

D’Haro

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In artistic gymnastics, difficulty score or D-score is used for judging performance. Starting from zero, an athlete earns points from different aspects such as composition requirement, difficulty, and connection between moves. The final score is a composition of the quality of various performance indicators. Similarly, when evaluating dialogue responses, human judges generally follow a number of criteria, among which language fluency, context coherence, logical consistency, and semantic appropriateness are on top of the agenda. In this paper, we propose an automatic dialogue evaluation framework called D-score that resembles the way gymnastics is evaluated. Following the four human judging criteria above, we devise a range of evaluation tasks and model them under a multi-task learning framework. The proposed framework, without relying on any human-written reference, learns to appreciate the overall quality of human-human conversations through a representation that is shared by all tasks without over-fitting to individual task domain. We evaluate D-score by performing comprehensive correlation analyses with human judgement on three dialogue evaluation datasets, among which two are from past DSTC series, and benchmark against state-of-the-art baselines. D-score not only outperforms the best baseline by a large margin in terms of system-level Spearman correlation but also represents an important step towards explainable dialogue scoring.

show abstract

Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Cited by 32 publications

References 18 publications

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

D-Score: Holistic Dialogue Evaluation Without Reference

Contact Info

Product

Resources

About