The First Workshop on Evaluations and Assessments of Neural Conversation Systems 2021
DOI: 10.18653/v1/2021.eancs-1.3
|View full text |Cite
|
Sign up to set email alerts
|

A Comprehensive Assessment of Dialog Evaluation Metrics

Abstract: Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

6
78
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 40 publications
(84 citation statements)
references
References 14 publications
6
78
0
Order By: Relevance
“…The same story also happens for Flow score, a state-of-the-art metric in the DSTC9 dataset. This observation is consistent with study from previous work (Yeh et al, 2021).…”
Section: Results and Analysissupporting
confidence: 94%
See 3 more Smart Citations
“…The same story also happens for Flow score, a state-of-the-art metric in the DSTC9 dataset. This observation is consistent with study from previous work (Yeh et al, 2021).…”
Section: Results and Analysissupporting
confidence: 94%
“…The common practice to show the effectiveness of a dialogue evaluation metric is to calculate the Pearson, Spearman's, and Kendall correlation between human evaluation and the automatic evaluation (Mehri and Eskénazi, 2020;Yeh et al, 2021). Table 2 list the correlations between automatic metrics and human evaluation.…”
Section: Results and Analysismentioning
confidence: 99%
See 2 more Smart Citations
“…Many researches on chatbot assessment are usually concerned about the local and technical metrics (e.g. fluency, diversity, interesting, informative, etc) (Mehri and Eskénazi, 2020a;Yeh et al, 2021). Under these criteria, chatbots can provide useful, interesting, and informative responses in online interactions with humans.…”
Section: Introductionmentioning
confidence: 99%