Designing Precise and Robust Dialogue Response Evaluators

Zhao, Tianyu; Lala, Divesh; Kawahara, Tatsuya

doi:10.18653/v1/2020.acl-main.4

Cited by 33 publications

(39 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At the linguistic level, once speech has arrived from the user and the system responds to it, the monitor evaluates the appropriateness of the response turn by turn. We can turn to a large variety of dialogue evaluation models that have been applied to linguistic input [8,10,11], which decide if the response from the system is appropriate enough. In the current prototype, we fine-tune a large-scale pre-trained model BERT [12] with appropriateness labels annotated in our previous study [2], where each system response was annotated by binary: appropriate or not.…”

Section: Dialogue Monitoring For Detecting Breakdownmentioning

confidence: 99%

Semi-autonomous avatar enabling unconstrained parallel conversations –seamless hybrid of WOZ and autonomous dialogue systems–

et al. 2021

Self Cite

View full text Add to dashboard Cite

Many people are now engaged in remote conversations for a wide variety of scenes such as interviewing, counseling, and consulting, but there is a limited number of skilled experts. We propose a novel framework of parallel conversations with semi-autonomous avatars, where one operator collaborates with several remote robots or agents simultaneously. The autonomous dialogue system mostly manages the conversation, but switches to the human operator when necessary. This framework circumvents the requirement for autonomous systems to be completely perfect. Instead, we need to detect dialogue breakdown or disengagement. We present a prototype of this framework for attentive listening.

show abstract

Section: Dialogue Monitoring For Detecting Breakdownmentioning

confidence: 99%

Semi-autonomous avatar enabling unconstrained parallel conversations –seamless hybrid of WOZ and autonomous dialogue systems–

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…We find that the model trained with our negative samples alongside random negative samples shows a higher correlation with human evaluations than the models trained only on random negative samples, in experiments using two datasets (Zhao et al, 2020). We also find evidence that automatic evaluation systems trained with the negative samples generated by our proposed method can make decisions closer to human judgment than those without.…”

Section: Introductionmentioning

confidence: 53%

“…To measure the correlation between model predictions and human evaluations, we use the responseevaluation dataset proposed by Zhao et al (2020). The dataset contains dialogue histories, machinegenerated responses, golden responses, and appropriateness scores evaluated by human annotators.…”

Section: Datasetmentioning

confidence: 99%

“…Some of the reference-based metrics are simple comparison methods, rather than trainable models, but are presented along with other models because they can also be used to estimate the quality of responses. It should be noted that we do not compare the unsupervised approaches listed below with supervised approaches, such as the ones proposed by ; Zhao et al (2020), which require human-annotated response-evaluation pairs for training.…”

Section: Modelsmentioning

confidence: 99%

“…We split the conversations in the Daily-Dialog dataset in a sliding window manner to construct pairs of dialogue histories and corresponding responses. The maximum turn of the dialogue history was set to 5, following Zhao et al (2020).…”

Section: Implementation Detailsmentioning

confidence: 99%

See 2 more Smart Citations

Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model

Park¹,

Jang²,

Yang³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based metrics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, researchers have investigated the possibility of assessing response quality without using a set of known correct responses. Tao et al. (2018) demonstrated that an automatic response evaluation model could be made using unsupervised learning for the next-utterance prediction (NUP) task. For unsupervised learning of such a model, we propose a method of manipulating a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high similarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative samples can increase the model's correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation. 1

show abstract