Keelan Evanini scite author profile

The INTERSPEECH 2016 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition under well-defined conditions: classification of deceptive vs. non-deceptive speech, the estimation of the degree of sincerity, and the identification of the native language out of eleven L1 classes of English L2 speakers. In this paper, we describe these sub-challenges, their conditions, the baseline feature extraction and classifiers, and the resulting baselines, as provided to the participants.

show abstract

Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

Ubale

Ramanaryanan

et al. 2017

View full text Add to dashboard Cite

Automated Scoring of Nonnative Speech Using the SpeechRater^SM v. 5.0 Engine

Chen

Zechner

Yoon

et al. 2018

ETS Research Report Series

View full text Add to dashboard Cite

This research report provides an overview of the R&D efforts at Educational Testing Service related to its capability for automated scoring of nonnative spontaneous speech with the SpeechRaterSM automated scoring service since its initial version was deployed in 2006. While most aspects of this R&D work have been published in various venues in recent years, no comprehensive account of the current state of SpeechRater has been provided since the initial publications following its first operational use in 2006. After a brief review of recent related work by other institutions, we summarize the main features and feature classes that have been developed and introduced into SpeechRater in the past 10 years, including features measuring aspects of pronunciation, prosody, vocabulary, grammar, content, and discourse. Furthermore, new types of filtering models for flagging nonscorable spoken responses are described, as is our new hybrid way of building linear regression scoring models with improved feature selection. Finally, empirical results for SpeechRater 5.0 (operationally deployed in 2016) are provided.

show abstract

Modeling Discourse Coherence for the Automated Scoring of Spontaneous Spoken Responses

Wang

Evanini

Zechner

et al. 2017

View full text Add to dashboard Cite

This study describes an approach for modeling the discourse coherence of spontaneous spoken responses in the context of automated assessment of non-native speech. Although the measurement of discourse coherence is typically a key metric in human scoring rubrics for assessments of spontaneous spoken language, little prior research has been done to assess a speaker's coherence in the context of automated speech scoring. To address this, we first present a corpus of spoken responses drawn from an assessment of English proficiency that has been annotated for discourse coherence. When adding these discourse annotations as features to an automated speech scoring system, the accuracy in predicting human proficiency scores is improved by 7.8% relative, thus demonstrating the effectiveness of including coherence information in the task of automated scoring of spontaneous speech. We further investigate the use of two different sets of features to automatically model the coherence quality of spontaneous speech, including a set of features originally designed to measure text complexity and a set of surface-based features describing the speaker's use of nouns, pronouns, conjunctions, and discourse connectives in the spoken response. Additional experiments demonstrate that an automated speech scoring system can benefit from coherence scores that are generated automatically using these feature sets.

show abstract

Automatic Detection of Off-Topic Spoken Responses Using Very Deep Convolutional Neural Networks

Wang

Yoon

Evanini

et al. 2019

View full text Add to dashboard Cite

Test takers in high-stakes speaking assessments may try to inflate their scores by providing a response to a question that they are more familiar with instead of the question presented in the test; such a response is referred to as an off-topic spoken response. The presence of these responses can make it difficult to accurately evaluate a test taker's speaking proficiency, and thus may reduce the validity of assessment scores. This study aims to address this problem by building an automatic system to detect off-topic spoken responses which can inform the downstream automated scoring pipeline. We propose an innovative method to interpret the comparison between a test response and the question used to elicit it as a similarity grid, and then apply very deep convolutional neural networks to determine different degrees of topic relevance. In this study, Inception networks were applied to this task, and the experimental results demonstrate the effectiveness of the proposed method. Our system achieves an F1-score of 92.8% on the class of off-topic responses, which significantly outperforms a baseline system using a range of word embedding-based similarity metrics (F1score = 85.5%).

show abstract

Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language

Zechner

Evanini

Yoon

et al. 2014

View full text Add to dashboard Cite

This paper describes an end-to-end prototype system for automated scoring of spoken responses in a novel assessment for teachers of English as a Foreign Language who are not native speakers of English. The 21 speaking items contained in the assessment elicit both restricted and moderately restricted responses, and their aim is to assess the essential speaking skills that English teachers need in order to be effective communicators in their classrooms. Our system consists of a state-of-the-art automatic speech recognizer; multiple feature generation modules addressing diverse aspects of speaking proficiency, such as fluency, pronunciation, prosody, grammatical accuracy, and content accuracy; a filter that identifies and flags problematic responses; and linear regression models that predict response scores based on subsets of the features. The automated speech scoring system was trained and evaluated on a data set involving about 1,400 test takers, and achieved a speaker-level correlation (when scores for all 21 responses of a speaker are aggregated) with human expert scores of 0.73.

show abstract

Caller Experience: A method for evaluating dialog systems and its automatic prediction

Evanini¹,

Hunter²,

Liscombe³

et al. 2008

View full text Add to dashboard Cite

In this paper we introduce a subjective metric for evaluating the performance of spoken dialog systems, Caller Experience (CE). CE is a useful metric for tracking the overall performance of a system in deployment, as well as for isolating individual problematic calls in which the system under-performs. The proposed CE metric differs from most performance evaluation metrics proposed in the past in that it is a) a subjective, qualitative rating of the call, and b) provided by expert, external listeners, not the callers themselves. The results of an experiment in which a set of human experts listened to the same calls three times are presented. The fact that these results show a high level of agreement among different listeners, despite the subjective nature of the task, demonstrates the validity of using CE as a standard metric. Finally, an automated rating system using objective measures is shown to perform at the same high level as the humans. This is an important advance, since it provides a way to reduce the human labor costs associated with producing a reliable CE.

show abstract

Automated Scoring for the TOEFL Junior^® Comprehensive Writing and Speaking Test

Evanini

Heilman

Wang

et al. 2015

ETS Research Report Series

View full text Add to dashboard Cite

This report describes the initial automated scoring results that were obtained using the constructed responses from the Writing and Speaking sections of the pilot forms of the TOEFL Junior® Comprehensive test administered in late 2011. For all of the items except one (the edit item in the Writing section), existing automated scoring capabilities were used with only minor modifications to obtain a baseline benchmark for automated scoring performance on the TOEFL Junior task types; for the edit item in the Writing section, a new automated scoring capability based on string matching was developed. A generic scoring model from the e‐rater® automated essay scoring engine was used to score the email, opinion, and listen‐write items in the Writing section, and the form‐level results based on the five responses in the Writing section from each test taker showed a human–machine correlation of r = .83 (compared to a human–human correlation of r = .90). For scoring the Speaking section, new automated speech recognition models were first trained, and then item‐specific scoring models were built for the read‐aloud picture narration, and listen‐speak items using preexisting features from the SpeechRaterSM automated speech scoring engine (with the addition of a new content feature for the listen‐speak items). The form‐level results based on the five items in the Speaking section from each test taker showed a human–machine correlation of r = .81 (compared to a human–human correlation of r = .89).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Keelan Evanini

The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language

Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

Automated Scoring of Nonnative Speech Using the SpeechRater^SM v. 5.0 Engine

Modeling Discourse Coherence for the Automated Scoring of Spontaneous Spoken Responses

Automatic Detection of Off-Topic Spoken Responses Using Very Deep Convolutional Neural Networks

Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language

Caller Experience: A method for evaluating dialog systems and its automatic prediction

Automated Scoring for the TOEFL Junior^® Comprehensive Writing and Speaking Test

Contact Info

Product

Resources

About

Keelan Evanini

The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language

Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

Automated Scoring of Nonnative Speech Using the SpeechRaterSM v. 5.0 Engine

Modeling Discourse Coherence for the Automated Scoring of Spontaneous Spoken Responses

Automatic Detection of Off-Topic Spoken Responses Using Very Deep Convolutional Neural Networks

Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language

Caller Experience: A method for evaluating dialog systems and its automatic prediction

Automated Scoring for the TOEFL Junior® Comprehensive Writing and Speaking Test

Contact Info

Product

Resources

About

Automated Scoring of Nonnative Speech Using the SpeechRater^SM v. 5.0 Engine

Automated Scoring for the TOEFL Junior^® Comprehensive Writing and Speaking Test