A Systematic Evaluation of Response Selection for Open Domain Dialogue

Hedayatnia, Behnam; Jin, Di; Liu, Yang; Hakkani-Tur, Dilek

doi:10.18653/v1/2022.sigdial-1.30

Cited by 1 publication

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a study by Hedayatnia et al (2022), they demonstrated that using a human-chatbot dataset, where responses were generated by multiple response generators and then annotated by humans for training RS (response selection) models, led to improved performance compared to models trained on synthetically generated datasets. Unfortunately, the dataset they used could not be made public due to privacy concerns, as it contained real-user dialogs.…”

Section: Related Workmentioning

confidence: 99%

“…These synthetically curated test sets are not sufficient representations of real-world inference time candidates that are generated by dialog models. Hedayatnia et al (2022) demonstrated the effectiveness of training on machine-generated candidates from real user interactions over using synthetic candidates for response selection. However this data is not publicly available.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MERCY: Multiple Response Ranking Concurrently in Realistic Open-Domain Conversational Systems

Ghazarian,

Hedayatnia,

Jin

et al. 2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Automatic Evaluation (AE) and Response Selection (RS) models assign quality scores to various candidate responses and rank them in conversational setups. Prior response ranking research compares various models' performance on synthetically generated test sets. In this work, we investigate the performance of model-based reference-free AE and RS models on our constructed response ranking datasets that mirror real-case scenarios of ranking candidates during inference time. Metrics' unsatisfying performance can be interpreted as their low generalizability over more pragmatic conversational domains such as human-chatbot dialogs. To alleviate this issue we propose a novel RS model called MERCY that simulates human behavior in selecting the best candidate by taking into account distinct candidates concurrently and learns to rank them. In addition, MERCY leverages natural language feedback as another component to help the ranking task by explaining why each candidate response is relevant/irrelevant to the dialog context. These feedbacks are generated by prompting large language models in a few-shot setup. Our experiments show the better performance of MERCY over baselines for the response ranking task in our curated realistic datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%