Neural Automated Essay Scoring Incorporating Handcrafted Features

Uto, Masaki; Xie, Yizhen; Ueno, Maomi

doi:10.18653/v1/2020.coling-main.535

Cited by 48 publications

(30 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, our model could be enhanced by multiple techniques found in other works such as adding domain-specific features [47,51] or recognizing question types or their difficulty [34], and utilizing specialized methods for each of them though on the downside this increases system complexity. It is especially interesting to leverage rubrics as defined by teachers.…”

Section: Discussion and Limitationsmentioning

confidence: 99%

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Schneider¹,

Richner²,

Riser³

2022

Preprint

View full text Add to dashboard Cite

Autograding short textual answers has become much more feasible due to the rise of NLP and the increased availability of question-answer pairs brought about by a shift to online education. Autograding performance is still inferior to human grading. The statistical and black-box nature of state-of-the-art machine learning models makes them untrustworthy, raising ethical concerns and limiting their practical utility. Furthermore, the evaluation of autograding is typically confined to small, monolingual datasets for a specific question type. This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages covering diverse fields such as math and language, and strong variation in question and answer syntax. We demonstrate the effectiveness of finetuning transformer models for autograding for such complex datasets. Our best hyperparameter-tuned model yields an accuracy of about 86.5%, comparable to the state-of-the-art models that are less general and more tuned to a specific type of question, subject, and language. More importantly, we address trust and ethical concerns. By involving humans in the autograding process, we show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants. We also show how teachers can effectively control the type of errors made by the system and how they can validate efficiently that the autograder's performance on individual exams is close to the expected performance.

show abstract

Section: Discussion and Limitationsmentioning

confidence: 99%

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Schneider¹,

Richner²,

Riser³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most modern NLP systems have started to use attention-based transformer networks and large pretrained language models. Yang et al (2020), Uto et al (2020) use the We have uploaded the rest of the requirements along with the code base-uncased (Devlin et al, 2019) pre-trained language model to perofrm automatic essay grading achieving QWKs in the range of 0.79 to 0.805. However, BERT has about 110 million parameters (compared to our largest model with just under 2 million parameters).…”

Section: Comparison With Transformer Modelsmentioning

confidence: 99%

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

Kumar¹,

Mathias²,

Saha³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Most research in the area of automatic essay grading (AEG) is geared towards scoring the essay holistically while there has also been little work done on scoring individual essay traits. In this paper, we describe a way to score essays using a multi-task learning (MTL) approach, where scoring the essay holistically is the primary task, and scoring the essay traits is the auxiliary task. We compare our results with a single-task learning (STL) approach, using both LSTMs and BiLSTMs. To find out which traits work best for different types of essays, we conduct ablation tests for each of the essay traits. We also report the runtime and number of training parameters for each system. We find that MTL-based BiLSTM system gives the best results for scoring the essay holistically, as well as performing well on scoring the essay traits. The MTL systems also give a speed-up of between 2.30 to 3.70 times the speed of the STL system, when it comes to scoring the essay and all the traits.

show abstract

“…Transformers are generally able to vastly outperform regression on engineered features. However, in some text labeling tasks, such as essay scoring, it has been shown that engineered features can be used in tandem with Transformer output to improve performance (Uto et al, 2020). This can be achieved simply by concatenating a vector of features f n to BERT's CLS vector.…”

Section: Sentence-level Featuresmentioning

confidence: 99%

“…Our set of features was inspired by (Uto et al, 2020), but we excluded the readability metrics because they are not as relevant for our task. Specifically, for text sample x n , we calculate the number of words, number of sentences, number of exclamation marks, question marks, and commas, average word length, average sentence length, the number of nouns, verbs, adjectives, and adverbs, and the number of stop words.…”

Section: Sentence-level Featuresmentioning

confidence: 99%

UMass PCL at SemEval-2022 Task 4: Pre-trained Language Model Ensembles for Detecting Patronizing and Condescending Language

Koleczek¹,

Scarlatos²,

Karakare³

et al. 2022

Preprint

View full text Add to dashboard Cite

Patronizing and condescending language (PCL) is everywhere, but rarely is the focus on its use by media towards vulnerable communities. Accurately detecting PCL of this form is a difficult task due to limited labeled data and how subtle it can be. In this paper, we describe our system for detecting such language which was submitted to SemEval 2022 Task 4: Patronizing and Condescending Language Detection. Our approach uses an ensemble of pre-trained language models, data augmentation, and optimizing the threshold for detection. Experimental results on the evaluation dataset released by the competition hosts show that our work is reliably able to detect PCL, achieving an F1 score of 55.47% on the binary classification task and a macro F1 score of 36.25% on the fine-grained, multi-label detection task.

show abstract

Neural Automated Essay Scoring Incorporating Handcrafted Features

Cited by 48 publications

References 33 publications

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

UMass PCL at SemEval-2022 Task 4: Pre-trained Language Model Ensembles for Detecting Patronizing and Condescending Language

Contact Info

Product

Resources

About