Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems 2020
DOI: 10.18653/v1/2020.eval4nlp-1.3
|View full text |Cite
|
Sign up to set email alerts
|

Item Response Theory for Efficient Human Evaluation of Chatbots

Abstract: Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly wel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 44 publications
0
8
0
Order By: Relevance
“…In educational testing, collecting responses from humans is expensive; likewise, although questions are cheap in searchbased QA tasks (Nguyen et al, 2016;Kwiatkowski et al, 2019), annotating answers is expensive. Likewise, "grading" machine dialog responses is expensive and IRT helps (Sedoc and Ungar, 2020). To emulate this setting, we use computerized adaptive testing (Weiss and Kingsbury, 1984) to iteratively select SQuAD items to "annotate.…”
Section: Irt Improves Cold Start Reliabilitymentioning
confidence: 99%
“…In educational testing, collecting responses from humans is expensive; likewise, although questions are cheap in searchbased QA tasks (Nguyen et al, 2016;Kwiatkowski et al, 2019), annotating answers is expensive. Likewise, "grading" machine dialog responses is expensive and IRT helps (Sedoc and Ungar, 2020). To emulate this setting, we use computerized adaptive testing (Weiss and Kingsbury, 1984) to iteratively select SQuAD items to "annotate.…”
Section: Irt Improves Cold Start Reliabilitymentioning
confidence: 99%
“…Item response theory (IRT) has a similar formulation as BT, but also estimates the difficulty of each test instances using a latent-variable Bayesian model (Dras, 2015). IRT has been applied to perform dataset filtering (Lalor et al, 2016(Lalor et al, , 2019, evaluate chatbots from human assessments (Sedoc and Ungar, 2020), and aggregate human assessments in machine translation (Dras, 2015). Elo (Elo, 1978) and TrueSkill (Herbrich et al, 2007)…”
Section: Sources Of Disagreementmentioning
confidence: 99%
“…BT compares systems for each test instance and estimates the latent strength of systems based on how frequently one system scores higher than another. Such paired mechanisms have already been successfully used to aggregate human judgments (Novikova et al, 2018;Sedoc and Ungar, 2020); for example, WMT evaluation protocols regularly employ TrueSkill (Herbrich et al, 2007), a Bayesian variant of BT (Sakaguchi et al, 2014).…”
Section: Introductionmentioning
confidence: 99%
“…This allowed them to create a dynamic curriculum learning (Bengio et al, 2009) algorithm, which achieved superior performance to the same models trained using a static scheduler for several tasks. Sedoc and Ungar (2020) used IRT to efficiently assess chat-bots. Martínez-Plumed et al (2019) used IRT to analyze the performance of machine learning classifiers in a supervised learning task.…”
Section: Related Workmentioning
confidence: 99%