Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 2022
DOI: 10.18653/v1/2022.emnlp-main.718
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the Knowledge Dependency of Questions

Abstract: The automatic generation of Multiple Choice Questions (MCQ) has the potential to reduce the time educators spend on student assessment significantly. However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE, and METEOR, focus on the n-gram based similarity of the generated MCQ to the gold sample in the dataset and disregard their educational value. They fail to evaluate the MCQ's ability to assess the student's knowledge of the corresponding target fact. To tackle this issue, we propose a n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 24 publications
0
1
0
Order By: Relevance
“…Following standard reference-based evaluation, n-gram overlap metrics such as BLEU (Papineni et al, 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) have been considered, where these metrics measure the overlap between generated distractors and the distractors from a set of human-annotated ground truth sequences. However, having reference-based distractor evaluation approaches has notable shortcomings (Moon et al, 2022). In particular, for a given multiple-choice question, the set of annotated distractors is unlikely to span the set of all possible good distractors, and some options may get unfairly penalised simply because no similar ones exist in the annotated set.…”
Section: Related Workmentioning
confidence: 99%
“…Following standard reference-based evaluation, n-gram overlap metrics such as BLEU (Papineni et al, 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) have been considered, where these metrics measure the overlap between generated distractors and the distractors from a set of human-annotated ground truth sequences. However, having reference-based distractor evaluation approaches has notable shortcomings (Moon et al, 2022). In particular, for a given multiple-choice question, the set of annotated distractors is unlikely to span the set of all possible good distractors, and some options may get unfairly penalised simply because no similar ones exist in the annotated set.…”
Section: Related Workmentioning
confidence: 99%