“…Following standard reference-based evaluation, n-gram overlap metrics such as BLEU (Papineni et al, 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) have been considered, where these metrics measure the overlap between generated distractors and the distractors from a set of human-annotated ground truth sequences. However, having reference-based distractor evaluation approaches has notable shortcomings (Moon et al, 2022). In particular, for a given multiple-choice question, the set of annotated distractors is unlikely to span the set of all possible good distractors, and some options may get unfairly penalised simply because no similar ones exist in the annotated set.…”