Evaluating the Knowledge Dependency of Questions

Hyeongdon, Moon,; Yang, Young-Kyu; Yu, Hangyeol; Lee, Seung–Hyun; Jeong, Myeongho; Park, Juneyoung; Shin, Jamin; Minsam, Kim,; Choi, Seungtaek

doi:10.18653/v1/2022.emnlp-main.718

Cited by 2 publications

(1 citation statement)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following standard reference-based evaluation, n-gram overlap metrics such as BLEU (Papineni et al, 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) have been considered, where these metrics measure the overlap between generated distractors and the distractors from a set of human-annotated ground truth sequences. However, having reference-based distractor evaluation approaches has notable shortcomings (Moon et al, 2022). In particular, for a given multiple-choice question, the set of annotated distractors is unlikely to span the set of all possible good distractors, and some options may get unfairly penalised simply because no similar ones exist in the annotated set.…”

Section: Related Workmentioning

confidence: 99%

Assessing Distractors in Multiple-Choice Tests

Raina,

Liusie,

Gales

2023

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

Multiple-choice tests are a common approach for assessing candidates' comprehension skills. Standard multiple-choice reading comprehension exams require candidates to select the correct answer option from a discrete set based on a question in relation to a contextual passage. For appropriate assessment, the distractor answer options must by definition be incorrect but plausible and diverse. However, generating good quality distractors satisfying these criteria is a challenging task for content creators. We propose automated assessment metrics for the quality of distractors in multiple-choice reading comprehension tests. Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options. We assess incorrectness using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is assessed by considering the distractor confidence -the probability mass associated with the distractor options for a standard multi-class multiplechoice reading comprehension system. Diversity is assessed by pairwise comparison of an embedding-based equivalence metric between the distractors of a question. To further validate the plausibility metric we compare against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.

show abstract

Section: Related Workmentioning

confidence: 99%