Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators

Yi, Sanghyun; Goel, Rahul; Khatri, Chandra; Cervone, Alessandra; Chung, Tagyoung; Hedayatnia, Behnam; Venkatesh, Anu; Gabriel, Raefer; Hakkani‐Tür, Dilek

doi:10.18653/v1/w19-8608

Cited by 24 publications

(22 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Eventually, as Table 2 demonstrates, the mean κ agreement and mean Pearson correlation between evaluators participating in our experiments were 0.52 and 0.93. In the context of dialogue system evaluation where agreement is usually quite low (Venkatesh et al 2018;Ghandeharioun et al 2019;Yi et al 2019), these numbers show relatively high agreement between annotators. This provides evidence that engagement can be measured not only at the conversation level but also at the utterance level.…”

Section: Utterance-level Engagement Scoresmentioning

confidence: 99%

Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Ghazarian¹,

Weischedel²,

Galstyan³

et al. 2020

AAAI

View full text Add to dashboard Cite

User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, predictive engagement, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can be incorporated into automatic evaluation metrics for open-domain dialogue systems to improve the correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

show abstract

Section: Utterance-level Engagement Scoresmentioning

confidence: 99%

Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Ghazarian¹,

Weischedel²,

Galstyan³

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

“…Likability quantifies how much a set of one or more qualities makes a response more likable for a particular task. These qualities can be diversity (Li et al, 2016), sentiment (Rashkin et al, 2019), specificity (Ke et al, 2018), engagement (Yi et al, 2019), fluency (Kann et al, 2018) and more. A likable response may or may not be sensible to the context.…”

Section: Fundamental Aspectsmentioning

confidence: 99%

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Phy¹,

Zhao²,

Aizawa³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups: understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy 1 . We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics. Context: I'm sorry. It's out of stock now. Could you come by again next week?

show abstract

“…In this way, dialog systems could detect and react to user's disengagement in both open-domain dialogs (Yu et al, 2016) and taskoriented dialogs (Yu et al, 2017). During training, our model could also be used as real-time feedback to benefit dialog policy learning (Yi et al, 2019). Second, HERALD could quantify user engagement and be used as an automatic dialog evaluation metric.…”

Section: Introductionmentioning

confidence: 99%

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Liang¹,

Liang²,

Yu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Open-domain dialog systems have a usercentric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.

show abstract

Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators

Cited by 24 publications

References 41 publications

Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Predictive Engagement: An Efficient Metric for Automatic Evaluation of Open-Domain Dialogue Systems

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Contact Info

Product

Resources

About