A Survey on Recent Approaches to Question Difficulty Estimation from Text

Benedetto, Luca; Cremonesi, Paolo; Caines, Andrew; Buttery, Paula; Cappelli, Andrea; Giussani, Andrea; Turrin, Roberto

doi:10.1145/3556538

Cited by 13 publications

(21 citation statements)

References 84 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Paper Citation TF-IDF (Benedetto et al, 2020a(Benedetto et al, , 2020b) (Lin et al, 2015) Readability measures (Benedetto et al, 2020a;Choi & Moon, 2020) (Susanti et al, 2017;Yaneva et al, 2020) (Yaneva et al, 2019) Corpus analysis software (Choi & Moon, 2020;Pandarova et al, 2019) (El Masri et al, 2017Lee et al, 2019) (Beinborn et al, 2014(Beinborn et al, , 2015) (Loukina et al, 2016;Sano, 2015) Word embedding (Benedetto et al, 2021;Xu et al, 2022) (Bi et al, 2021;Loginova et al, 2021) (Susanti et al, 2020;Xue et al, 2020) (Yaneva et al, 2020;Zhou & Tao, 2020) (Yaneva et al, 2019;Yeung et al, 2019) (Cheng et al, 2019;Hsu et al, 2018) (Huang et al, 2017) Ontology-based metrics (Kurdi et al, 2021;Vinu & Kumar, 2020) (Faizan & Lohmann, 2018;Seyler et al, 2017) (Vinu et al, 2016;Vinu & Kumar, 2017) (Alsubait et al, 2016;Vinu & Kumar, 2015) LSTM/ BiLSTM (Lin et al, 2019;Qiu et al, 2019) (Cheng et al, 2019;Gao et al, 2018) Syntax-level Feature Extraction When investigating sources of difficulty in textual questions, textual co...…”

Section: Feature Extraction Methodsmentioning

confidence: 99%

“…Three types of baselines were found to be used for performance comparison: 1) comparison with an existing difficulty prediction model; 2) comparison with another feature extraction technique; or 3) comparison with one or more variants of the same model. Out of the 55 studies surveyed, only 8 papers compared their proposed model to an existing one (Benedetto et al, 2021(Benedetto et al, , 2020a(Benedetto et al, , 2020bQiu et al, 2019;Xu et al, 2022). This was mostly carried out using a different dataset and after making some modifications to the previous model.…”

Section: Evaluation Methodsmentioning

confidence: 99%

“…The proposed model achieved an accuracy of 67% over two neural network baselines; namely, basic BERT and BiLSTM. Additional pretraining for the transformers was also carried out in Benedetto et al (2021) while using BERT and DistilBERT to predict the difficulty of MCQs. A dataset covering the same topics that were assessed by the questions was used for additional taskspecific training.…”

Section: Semantic-level Feature Extractionmentioning

confidence: 99%

See 2 more Smart Citations

Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches

AlKhuzaey,

Grasso,

Payne

et al. 2023

Int J Artif Intell Educ

View full text Add to dashboard Cite

Designing and constructing pedagogical tests that contain items (i.e. questions) which measure various types of skills for different levels of students equitably is a challenging task. Teachers and item writers alike need to ensure that the quality of assessment materials is consistent, if student evaluations are to be objective and effective. Assessment quality and validity are therefore heavily reliant on the quality of the items included in the test. Moreover, the notion of difficulty is an essential factor that can determine the overall quality of the items and the resulting tests.Thus, item difficulty prediction is extremely important in any pedagogical learning environment. Although difficulty is traditionally estimated either by experts or through pre-testing, such methods are criticised for being costly, time-consuming, subjective and difficult to scale, and consequently, the use of automatic approaches as proxies for these traditional methods is gaining more and more traction. In this paper, we provide a comprehensive and systematic review of methods for the priori prediction of question difficulty. The aims of this review are to: 1) provide an overview of the research community regarding the publication landscape; 2) explore the use of automatic, text-based prediction models; 3) summarise influential difficulty features; and 4) examine the performance of the prediction models. Supervised machine learning prediction models were found to be mostly used to overcome the limitations of traditional item calibration methods. Moreover, linguistic features were found to play a major role in the determination of item difficulty levels, and several syntactic and semantic features were explored by researchers in this area to explain the difficulty of pedagogical assessments. Based on these findings, a number of challenges to the item difficulty prediction community are posed, including the need for a publicly available repository of standardised data-sets and further investigation into alternative feature elicitation and prediction models.

show abstract

Section: Feature Extraction Methodsmentioning

confidence: 99%

Section: Evaluation Methodsmentioning

confidence: 99%

Section: Semantic-level Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches

AlKhuzaey,

Grasso,

Payne

et al. 2023

Int J Artif Intell Educ

View full text Add to dashboard Cite

show abstract

“…In attempts to bypass field-testing, researchers have developed models to predict item difficulty from various item text features, including semantic and syntactic complexity, word and sentence lengths and counts, word embeddings, or readability indices (see AlKhuzaey et al, 2023;Benedetto et al, 2023). Some methods rely on expert judgement (Beinborn, Zesch, & Gurevych, 2014;Choi & Moon, 2020;Loukina et al, 2016;Settles, LaFlair & Hagiwara, 2020), but these subjective approaches can suffer from poor inter-rater-reliability (i.e., consistency between multiple judges) and replicate-ability (AlKhuzaey et al, 2023;Conejo et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…Some methods rely on expert judgement (Beinborn, Zesch, & Gurevych, 2014;Choi & Moon, 2020;Loukina et al, 2016;Settles, LaFlair & Hagiwara, 2020), but these subjective approaches can suffer from poor inter-rater-reliability (i.e., consistency between multiple judges) and replicate-ability (AlKhuzaey et al, 2023;Conejo et al, 2020). Others relied on machine-driven natural language processing (NLP) techniques to predict item difficulty and/or discrimination (Benedetto et al, 2020a(Benedetto et al, , 2020bBenedetto et al, 2021;Yaneva et al, 2019;Zhou & Tao, 2020). However, their level of prediction accuracy is limited, and a simple estimation of item difficulty and discrimination does not capture the comprehensive nature of traditional field-testing.…”

Section: Introductionmentioning

confidence: 99%

Field-testing multiple-choice questions with AI examinees

Maeda

2024

Preprint

View full text Add to dashboard Cite

Field-testing is a necessary but resource-intensive step in the development of high-quality educational assessments. I present an innovative method for field-testing newly written exam items by replacing human examinees with AI examinees. The proposed approach is demonstrated using 512 four-option multiple-choice English grammar questions. One thousand pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. The overall correlation between the true and predicted 2PL correct response probabilities was .68 (bias = 0.03, root-mean-squared-error = 0.19). The simulation study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach still fell short of the accuracy of analyses that can be achieved with human examinee response data. If further refined, potential resource savings in transition from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low quality field-test items in real exams, shorten test lengths, eliminate item exposure and sample size concerns, reduce overall cost, and help expand the item bank.

show abstract

A Quantitative Study of NLP Approaches to Question Difficulty Estimation

Benedetto

2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

A Survey on Recent Approaches to Question Difficulty Estimation from Text

Cited by 13 publications

References 84 publications

Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches

Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches

Field-testing multiple-choice questions with AI examinees

A Quantitative Study of NLP Approaches to Question Difficulty Estimation

Contact Info

Product

Resources

About