Atypical Inputs in Educational Applications

Yoon, Su‐Youn; Cahill, Aoife; Loukina, Anastassia; Zechner, Klaus; Riordan, Brian; Madnani, Nitin

doi:10.18653/v1/n18-3008

Cited by 11 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Features related to language use covered vocabulary, grammar and some aspects of discourse structure. An additional module was used to flag atypical responses where an automated score is likely to be unreliable [11,15]. See [12] for a detailed description of the features and the filtering module.…”

Section: Automated Scoring Enginementioning

confidence: 99%

Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency

Loukina¹,

Evanini²,

Mulholland³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

The COVID-19 pandemic has led to a dramatic increase in the use of face masks worldwide. Face coverings can affect both acoustic properties of the signal as well as speech patterns and have unintended effects if the person wearing the mask attempts to use speech processing technologies. In this paper we explore the impact of wearing face masks on the automated assessment of English language proficiency. We use a dataset from a largescale speaking test for which test-takers were required to wear face masks during the test administration, and we compare it to a matched control sample of test-takers who took the same test before the mask requirements were put in place. We find that the two samples differ across a range of acoustic measures and also show a small but significant difference in speech patterns. However, these differences do not lead to differences in human or automated scores of English language proficiency. Several measures of bias showed no differences in scores between the two groups.

show abstract

Section: Automated Scoring Enginementioning

confidence: 99%

Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency

Loukina¹,

Evanini²,

Mulholland³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…1) Several research studies have shown that essay scoring models are overstable (Yoon et al, 2018;Powers et al, 2002;Feng et al, 2018). Even large changes in essay content do not lead to significant change in scores.…”

Section: Introductionmentioning

confidence: 99%

Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

Kumar

Parekh

Singh

et al. 2023

dad

View full text Add to dashboard Cite

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.

show abstract

“…Motivated by the previous studies on testing automatic scoring systems [29,20,22], which show that AES models are vulnerable to atypical inputs, our aim is to gain some intuitions behind how models score a human written sample. For instance, these studies show that automatic scoring systems score high on construct-irrelevant inputs like speeches and false facts [22], gibberish text [17], repeated paragraphs and canned responses [20], etc but do not show why do the models award high scores in these cases.…”

Section: Introductionmentioning

confidence: 99%

My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism

Parekh¹,

Kumar²,

Chen³

et al. 2020

Preprint

View full text Add to dashboard Cite

Significant progress has been made in deep-learning based Automatic Essay Scoring (AES) systems in the past two decades. However, little research has been put to understand and interpret the blackbox nature of these deep-learning based scoring models. Recent work shows that automated scoring systems are prone to even common-sense adversarial samples. Their lack of natural language understanding capability raises questions on the models being actively used by millions of candidates for life-changing decisions. With scoring being a highly multimodal task, it becomes imperative for scoring models to be validated and tested on all these modalities. We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms and why they are susceptible to adversarial samples. We find that the systems tested consider essays not as a piece of prose having the characteristics of natural flow of speech and grammatical structure, but as 'word-soups' where a few words are much more important than the other words. Removing the context surrounding those few important words causes the prose to lose the flow of speech and grammar, however has little impact on the predicted score. We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as "the world is flat" actually increases the score instead of decreasing it.

show abstract

Atypical Inputs in Educational Applications

Cited by 11 publications

References 13 publications

Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency

Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency

Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism

Contact Info

Product

Resources

About