Proceedings of the 12th International Conference on Natural Language Generation 2019
DOI: 10.18653/v1/w19-8642
|View full text |Cite
|
Sign up to set email alerts
|

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Abstract: Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretationsee, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a)-most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human eva… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
28
1
3

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(39 citation statements)
references
References 41 publications
1
28
1
3
Order By: Relevance
“…Human judgments are often inconsistent for non-task driven chatbots, since there is no clear objective, which leads to low inter-annotator agreement (IAA) (Sedoc et al, 2019;Yuwono et al, 2019). However, Amidei et al (2019) point out that even with low IAA we can still find statistical significance. There are further tensions between local coherence assessments using standard evaluation sets and human interactive evaluation.…”
Section: Chatbot Evaluationmentioning
confidence: 79%
“…Human judgments are often inconsistent for non-task driven chatbots, since there is no clear objective, which leads to low inter-annotator agreement (IAA) (Sedoc et al, 2019;Yuwono et al, 2019). However, Amidei et al (2019) point out that even with low IAA we can still find statistical significance. There are further tensions between local coherence assessments using standard evaluation sets and human interactive evaluation.…”
Section: Chatbot Evaluationmentioning
confidence: 79%
“…The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016). Measuring and reporting IAA is not yet a standard practice in evaluating chatbots (Amidei et al, 2019a), and producing annotations with high IAA on open-domain conversations is prone to be impeded by subjective interpretation of feature definitions and idiosyncratic annotator behavior (Bishop and Herron, 2015). In our setting, annotator disagreement on a bot's human-like behavior can be interpreted as a feature of a bot's performance: A bot that manages to fool one of two annotators into believing it is human can be said to have performed better than a bot that does not manage to fool any annotator.…”
Section: On Inter-annotator Agreementmentioning
confidence: 99%
“…The instructions to be followed by annotators are often chosen ad-hoc and there are no unified definitions. Compounded with the use of often criticized Likert scales (Amidei et al, 2019a), these evaluations often yield a low agreement. The required cost and time efforts also inhibit the widespread use of such evaluations, which raises questions on the replicability, robustness, and thus significance of the results.…”
Section: Introductionmentioning
confidence: 99%
“…Further, Cohen's κ scores show that there is substantial (0.6-0.8] or almost perfect agreement (0.80-1.0] between experts for all measures except for NR, PU, and SI being weak (0.40-0.59) (Landis and Koch, 1977). Also, we calculated Krippendorff's α, which is technically a measure of evaluator disagreement rather than agreement and the most common of the measures in the set NLG papers surveyed in Amidei et al (2019). The Krippendorff's α scores for all the other measures are good [0.8-1.0] except for PO and SI measures, which are tentative [0.67-0.8) and PU measure, which should be discarded because it is 0.04 lower than the threshold 0.67 Krippendorff (1980).…”
Section: Comparing Crowd With Expertmentioning
confidence: 99%