A Comprehensive Assessment of Dialog Evaluation Metrics

Yeh, Yu-Chen; Eskénazi, Maxine; Mehri, Shikib

doi:10.18653/v1/2021.eancs-1.3

Cited by 40 publications

(84 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same story also happens for Flow score, a state-of-the-art metric in the DSTC9 dataset. This observation is consistent with study from previous work (Yeh et al, 2021).…”

Section: Results and Analysissupporting

confidence: 94%

“…The common practice to show the effectiveness of a dialogue evaluation metric is to calculate the Pearson, Spearman's, and Kendall correlation between human evaluation and the automatic evaluation (Mehri and Eskénazi, 2020;Yeh et al, 2021). Table 2 list the correlations between automatic metrics and human evaluation.…”

Section: Results and Analysismentioning

confidence: 99%

“…FlowEval Can Provide Complementary Information to Other Methods. Similar to Yeh et al (2021), we test different combinations of metrics by directly averaging one metric with the best metrics on the three datasets, which are BERTScore on Controllable Dialogue dataset, DynaEval_emp on FED dataset, and Flow score on DSTC9 dataset. The last 4 rows of Table 2 show that FlowEval can consistently push the current correlation ceiling to a new level the most, while many other combinations improve little or even hurt performance.…”

Section: Results and Analysismentioning

confidence: 99%

“…Recent works tackle this problem by leveraging more sophisticated architectures and harnessing the power of large models (Mehri and Eskénazi, 2020). Although these recent metrics claim to show some progress towards higher correlation with humans, the gap between automatic metrics and human evaluation is still noticeable (Yeh et al, 2021). Automatic open-domain dialogue evaluation is still an open question, and extensive efforts have been made to improve performance from different angles (Pang et al, 2020;Ghazarian et al, 2019;Mehri and Eskenazi, 2020;Phy et al, 2020).…”

Section: Dialogue Sectionmentioning

confidence: 99%

See 3 more Smart Citations

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows

Zhao¹,

Li²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains an open problem. We explore the potential of dialogue evaluation featuring dialog act information, which was hardly explicitly modeled in previous methods. However, defined at the utterance level in general, dialog act is of coarse granularity, as an utterance can contain multiple segments possessing different functions. Hence, we propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it. To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval. This framework provides a reference-free approach for dialog evaluation by finding pseudo-references. Extensive experiments against strong baselines on three benchmark datasets demonstrate the effectiveness and other desirable characteristics of our FlowEval, pointing out a potential path for better dialogue evaluation. * Equal contributions. Wanyu participated in building the segment act dataset, while doing her internship with Liwei.

show abstract

“…The same story also happens for Flow score, a state-of-the-art metric in the DSTC9 dataset. This observation is consistent with study from previous work (Yeh et al, 2021).…”

Section: Results and Analysissupporting

confidence: 94%

Section: Results and Analysismentioning

confidence: 99%

Section: Results and Analysismentioning

confidence: 99%

Section: Dialogue Sectionmentioning

confidence: 99%

See 2 more Smart Citations

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows

Zhao¹,

Li²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Many researches on chatbot assessment are usually concerned about the local and technical metrics (e.g. fluency, diversity, interesting, informative, etc) (Mehri and Eskénazi, 2020a;Yeh et al, 2021). Under these criteria, chatbots can provide useful, interesting, and informative responses in online interactions with humans.…”

Section: Introductionmentioning

confidence: 99%

Mental Health Assessment for the Chatbots

Shan¹,

Zhang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Previous researches on dialogue system assessment usually focus on the quality evaluation (e.g. fluency, relevance, etc) of responses generated by the chatbots, which are local and technical metrics. For a chatbot which responds to millions of online users including minors, we argue that it should have a healthy mental tendency in order to avoid the negative psychological impact on them. In this paper, we establish several mental health assessment dimensions for chatbots (depression, anxiety, alcohol addiction, empathy) and introduce the questionnaire-based mental health assessment methods. We conduct assessments on some well-known open-domain chatbots and find that there are severe mental health issues for all these chatbots. We consider that it is due to the neglect of the mental health risks during the dataset building and the model training procedures. We expect to attract researchers' attention to the serious mental health problems of chatbots and improve the chatbots' ability in positive emotional interaction.

show abstract