Evaluating the performance of Automated Text Scoring systems

Yannakoudakis, Helen; Cummins, Ronan

doi:10.3115/v1/w15-0625

Cited by 33 publications

(26 citation statements)

References 29 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All models are trained on our training set (see Section 4), except the one prefixed 'word2vec pre-trained ' which uses pre-trained embeddings on the Google News Corpus. We report the Spearman's rank correlation coefficient ρ, Pearson's product-moment correlation coefficient r, and the root mean square error (RMSE) between the predicted scores and the gold standard on our test set, which are considered more appropriate metrics for evaluating essay scoring systems (Yannakoudakis and Cummins, 2015). However, we also report Cohen's κ with quadratic weights, which was the evaluation metric used in the Kaggle competition.…”

Section: Resultsmentioning

confidence: 99%

Automatic Text Scoring Using Neural Networks

Alikaniotis¹,

Yannakoudakis²,

Rei³

2016

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

199

182

View full text Add to dashboard Cite

Automated Text Scoring (ATS) provides a cost-effective and consistent alternative to human marking. However, in order to achieve good performance, the predictive features of the system need to be manually engineered by human experts. We introduce a model that forms word representations by learning the extent to which specific words contribute to the text's score. Using Long-Short Term Memory networks to represent the meaning of texts, we demonstrate that a fully automated framework is able to achieve excellent results over similar approaches. In an attempt to make our results more interpretable, and inspired by recent advances in visualizing neural networks, we introduce a novel method for identifying the regions of the text that the model has found more discriminative.

show abstract

Section: Resultsmentioning

confidence: 99%

Automatic Text Scoring Using Neural Networks

Alikaniotis¹,

Yannakoudakis²,

Rei³

2016

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

199

182

View full text Add to dashboard Cite

show abstract

“…In this paper, we predict real-valued scores on a continuous scale and evaluate the accuracy of the predicted scores by using mean squared error (MSE) as our default metric. Although some previous studies have used quadratically-weighted kappa (QWK) as another possible metric for evaluating content-scoring models, more recent work has shown that QWK may possess properties that render it less than suitable for automated scoring evaluation (Yannakoudakis and Cummins, 2015).…”

Section: Methodsmentioning

confidence: 99%

A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring

Madnani¹,

Loukina²,

Cahill³

2017

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

View full text Add to dashboard Cite

We explore various supervised learning strategies for automated scoring of content knowledge for a large corpus of 130 different content-based questions spanning four subject areas (Science, Math, English Language Arts, and Social Studies) and containing over 230,000 responses scored by human raters. Based on our analyses, we provide specific recommendations for content scoring. These are based on patterns observed across multiple questions and assessments and are, therefore, likely to generalize to other scenarios and prove useful to the community as automated content scoring becomes more popular in schools and classrooms.

show abstract

“…Being able to detect topical relevance can help prevent such weaknesses, provide useful feedback to the students, and is also a step towards evaluating more creative aspects of learner writing. While there is existing work on detecting answer relevance given a textual prompt (Persing and Ng, 2014;Cummins et al, 2015;Rei and Cummins, 2016), only limited previous research has been done to extend this to visual prompts. Some recent work has investigated answer relevance to visual prompts as part of automated scoring systems (Somasundaran et al, 2015;King and Dickinson, 2016), but they reduced the problem to a textual similarity task by relying on hand-written reference descriptions for each image without directly incorporating visual information.…”

Section: Relevance Detection Modelmentioning

confidence: 99%

“…While there is previous work on assessing the relevance of answers given a textual prompt (Persing and Ng, 2014;Cummins et al, 2015;Rei and Cummins, 2016), very little research has been done to incorporate visual writing prompts. In this setting, students are asked to write a short description about an image in order to assess their language skills, and we would like to automatically evaluate the semantic relevance of their answers.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Tetreault

Burstein

Leacock³

et al. 2017

View full text Add to dashboard Cite

Since the first workshop in 1997, BEA has become the leading venue for sharing and publishing innovative work that uses NLP to develop educational applications. The consistent interest and growth of the workshop has clear ties to challenges in education. The research presented at the workshop highlights advances in the technology and the maturity of the field of NLP in education. The capabilities serve as a response to educational challenges and are poised to support the needs of a variety of stakeholders, including educators, learners, parents, and administrators.NLP capabilities now support an array of learning domains, including writing, speaking, reading, and mathematics. In the writing and speech domains, automated writing evaluation (AWE) and speech assessment applications, respectively, are commercially deployed in high-stakes assessment and instructional settings, including Massive Open Online Courses (MOOCs). We also see widelyused commercial applications for plagiarism detection and peer review and explosive growth of mobile applications for game-based applications for instruction and assessment. The current educational and assessment landscape continues to foster a strong interest and high demand that pushes the state of the art in AWE capabilities to expand the analysis of written responses to writing genres other than those traditionally found in standardized assessments, especially writing tasks requiring use of sources and argumentative discourse.Steady growth in the development of NLP-based applications for education has prompted an increased number of workshops that typically focus on a single subfield. In BEA, we make an effort to have papers from many subfields, for example, tools for automated scoring, automated test-item generation, curriculum development, evaluation of text, dialogue, evaluation of genres beyond essays, feedback studies, and grammatical error correction.This year we received a record 62 submissions, and accepted 9 papers as oral presentations and 25 as poster presentation and/or demos, for an overall acceptance rate of 55 percent. Each paper was reviewed by three members of the Program Committee who were believed to be most appropriate for each paper. We continue to have a very strong policy to deal with conflicts of interest. First, we made a concerted effort to not assign papers to reviewers to evaluate if the paper had an author from their institution. Second, with respect to the organizing committee, authors of papers for which there was a conflict of interest recused themselves from the discussions.While the field is growing, we do recognize that there is a core group of institutions and researchers who work in this area. With a higher acceptance rate, we were able to include papers from a wider variety of topics and institutions. The papers accepted were selected on the basis of several factors, including the relevance to a core educational problem space, the novelty of the approach or domain, and the strength of the research. The accepted papers were highly diverse -...

show abstract

Evaluating the performance of Automated Text Scoring systems

Cited by 33 publications

References 29 publications

Automatic Text Scoring Using Neural Networks

Automatic Text Scoring Using Neural Networks

A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Contact Info

Product

Resources

About