What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

Ghazarian, Sarik; Hedayatnia, Behnam; Papangelis, Alexandros; Liu, Yang; Hakkani‐Tür, Dilek

doi:10.48550/arxiv.2203.13927

Cited by 1 publication

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another idea is to score a proposed system utterance by the probability that it elicits a particular user response type, such as disinterest or criticism [46,31,49,12].…”

Section: Related Workmentioning

confidence: 99%

“…For example, the FED framework scores proposed utterances by utilizing the Di-aloGPT LM probabilities for the subsequent user utterances such as "That is interesting" [56,31]. Predictive models can also be generated from large dialogue corpora by using the following user utterance as weak supervision [12,46].…”

Section: Related Workmentioning

confidence: 99%

“…Many recent approaches work with embedding features, such as sentiment scores and utterance classification, before predicting the user rating [53,14]. Sentiment analysis [21,12] and dialogue features analysis [30,42,49] methods, for instance, show solid results for predicting user-assigned ratings.…”

Section: Related Workmentioning

confidence: 99%

“…Next, the ODES categories assign the treatment T for the CF-LSTM. We also extract 3 different types of utterance sentiment features: sentiment valence, satisfaction, and activation, which are highly correlated with user ratings [21,12].…”

Section: Dialogue Turn-level Feature Extractorsmentioning

confidence: 99%

“…For task-oriented dialogues, frameworks such as Paradise [51] model the relationship between user satisfaction, task completion, and cost factors such as dialogue length, word error rate, and dialogue behaviors. However, open-domain dialogue systems such as those built for the Alexa Prize SocialBot Grand Challenge [36,9], where there is no clearly defined task, require new metrics and methods for evaluation that better reflect their affordances [49,21,13,12,16,47].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Le¹,

Dai²,

Johnston³

et al. 2023

Preprint

View full text Add to dashboard Cite

Effective evaluation methods remain a significant challenge for research on open-domain conversational dialogue systems. Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked, and those they give can be highly subjective. Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect. Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues. We compare four different approaches. First, we train a baseline model using an end-to-end transformer to predict ratings directly from the raw dialogue text. The other three methods are variants of a two-stage approach in which we first extract interpretable features at the turn level that capture, among other aspects, user dialogue behaviors indicating contradiction, repetition, disinterest, compliments, or criticism. We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings. The proposed CF-LSTM is a sequential model over turn-level features which predicts ratings using multiple regressors depending on hypotheses derived from the turn-level features. As a causal inference model, CF-LSTM aims to learn the underlying causes of a specific event, such as a low rating. We also bin the user ratings and perform classification experiments with all four models. In evaluation experiments on conversational data from the Alexa Prize SocialBot, we show that the CF-LSTM achieves the best performance for predicting dialogue ratings and classification.

show abstract

“…Another idea is to score a proposed system utterance by the probability that it elicits a particular user response type, such as disinterest or criticism [46,31,49,12].…”

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Dialogue Turn-level Feature Extractorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Le¹,

Dai²,

Johnston³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

Cited by 1 publication

References 14 publications

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Contact Info

Product

Resources

About