Item Response Theory for Efficient Human Evaluation of Chatbots

Sedoc, João; Ungar, Lyle H.

doi:10.18653/v1/2020.eval4nlp-1.3

Cited by 12 publications

(9 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In educational testing, collecting responses from humans is expensive; likewise, although questions are cheap in searchbased QA tasks (Nguyen et al, 2016;Kwiatkowski et al, 2019), annotating answers is expensive. Likewise, "grading" machine dialog responses is expensive and IRT helps (Sedoc and Ungar, 2020). To emulate this setting, we use computerized adaptive testing (Weiss and Kingsbury, 1984) to iteratively select SQuAD items to "annotate.…”

Section: Irt Improves Cold Start Reliabilitymentioning

confidence: 99%

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Rodríguez¹,

Barrow²,

Alexander³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks.

show abstract

Section: Irt Improves Cold Start Reliabilitymentioning

confidence: 99%

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Rodríguez¹,

Barrow²,

Alexander³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Item response theory (IRT) has a similar formulation as BT, but also estimates the difficulty of each test instances using a latent-variable Bayesian model (Dras, 2015). IRT has been applied to perform dataset filtering (Lalor et al, 2016(Lalor et al, , 2019, evaluate chatbots from human assessments (Sedoc and Ungar, 2020), and aggregate human assessments in machine translation (Dras, 2015). Elo (Elo, 1978) and TrueSkill (Herbrich et al, 2007)…”

Section: Sources Of Disagreementmentioning

confidence: 99%

“…BT compares systems for each test instance and estimates the latent strength of systems based on how frequently one system scores higher than another. Such paired mechanisms have already been successfully used to aggregate human judgments (Novikova et al, 2018;Sedoc and Ungar, 2020); for example, WMT evaluation protocols regularly employ TrueSkill (Herbrich et al, 2007), a Bayesian variant of BT (Sakaguchi et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

Better than Average: Paired Evaluation of NLP systems

Peyrard¹,

Zhao²,

Eger³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instancelevel pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

show abstract

“…This allowed them to create a dynamic curriculum learning (Bengio et al, 2009) algorithm, which achieved superior performance to the same models trained using a static scheduler for several tasks. Sedoc and Ungar (2020) used IRT to efficiently assess chat-bots. Martínez-Plumed et al (2019) used IRT to analyze the performance of machine learning classifiers in a supervised learning task.…”

Section: Related Workmentioning

confidence: 99%

Can Transformer Language Models Predict Psychometric Properties?

Laverghetta¹,

Nighojkar²,

Mirzakhalov³

et al. 2021

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformerand non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformerbased LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning. 1

show abstract

Item Response Theory for Efficient Human Evaluation of Chatbots

Cited by 12 publications

References 44 publications

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Better than Average: Paired Evaluation of NLP systems

Can Transformer Language Models Predict Psychometric Properties?

Contact Info

Product

Resources

About