Agreement is overrated: A plea for correlation to assess human evaluation reliability

Amidei, Jacopo; Piwek, Paul; Willis, Alistair

doi:10.18653/v1/w19-8642

Cited by 30 publications

(39 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Human judgments are often inconsistent for non-task driven chatbots, since there is no clear objective, which leads to low inter-annotator agreement (IAA) (Sedoc et al, 2019;Yuwono et al, 2019). However, Amidei et al (2019) point out that even with low IAA we can still find statistical significance. There are further tensions between local coherence assessments using standard evaluation sets and human interactive evaluation.…”

Section: Chatbot Evaluationmentioning

confidence: 79%

Item Response Theory for Efficient Human Evaluation of Chatbots

Sedoc¹,

Ungar²

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing highquality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

show abstract

Section: Chatbot Evaluationmentioning

confidence: 79%

Item Response Theory for Efficient Human Evaluation of Chatbots

Sedoc¹,

Ungar²

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

show abstract

“…The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016). Measuring and reporting IAA is not yet a standard practice in evaluating chatbots (Amidei et al, 2019a), and producing annotations with high IAA on open-domain conversations is prone to be impeded by subjective interpretation of feature definitions and idiosyncratic annotator behavior (Bishop and Herron, 2015). In our setting, annotator disagreement on a bot's human-like behavior can be interpreted as a feature of a bot's performance: A bot that manages to fool one of two annotators into believing it is human can be said to have performed better than a bot that does not manage to fool any annotator.…”

Section: On Inter-annotator Agreementmentioning

confidence: 99%

“…The instructions to be followed by annotators are often chosen ad-hoc and there are no unified definitions. Compounded with the use of often criticized Likert scales (Amidei et al, 2019a), these evaluations often yield a low agreement. The required cost and time efforts also inhibit the widespread use of such evaluations, which raises questions on the replicability, robustness, and thus significance of the results.…”

Section: Introductionmentioning

confidence: 99%

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Deriu

Tuggener

Däniken³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., Survival Analysis. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several stateof-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.

show abstract

“…Further, Cohen's κ scores show that there is substantial (0.6-0.8] or almost perfect agreement (0.80-1.0] between experts for all measures except for NR, PU, and SI being weak (0.40-0.59) (Landis and Koch, 1977). Also, we calculated Krippendorff's α, which is technically a measure of evaluator disagreement rather than agreement and the most common of the measures in the set NLG papers surveyed in Amidei et al (2019). The Krippendorff's α scores for all the other measures are good [0.8-1.0] except for PO and SI measures, which are tentative [0.67-0.8) and PU measure, which should be discarded because it is 0.04 lower than the threshold 0.67 Krippendorff (1980).…”

Section: Comparing Crowd With Expertmentioning

confidence: 99%

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Iskender¹,

Polzehl²,

Möller³

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and costeffective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

show abstract

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Cited by 30 publications

References 41 publications

Item Response Theory for Efficient Human Evaluation of Chatbots

Item Response Theory for Efficient Human Evaluation of Chatbots

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Contact Info

Product

Resources

About