Harry Bratt scite author profile

SRI International’s EduSpeak® system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to specific production problems. We review our approach to pronunciation scoring, where our aim is to estimate the grade that a human expert would assign to the pronunciation quality of a paragraph or a phrase. Using databases of nonnative speech and corresponding human ratings at the sentence level, we evaluate different machine scores that can be used as predictor variables to estimate pronunciation quality. For more specific feedback on pronunciation, the EduSpeak toolkit supports a phone-level mispronunciation detection functionality that automatically flags specific phone segments that have been mispronounced. Phone-level information makes it possible to provide the student with feedback about specific pronunciation mistakes.Two approaches to mispronunciation detection were evaluated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers. Results show that classification error of the best system, for the phones that can be reliably transcribed, is only slightly higher than the average pairwise disagreement between the human transcribers.

show abstract

Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems

Ferrer

Bratt

Richey

et al. 2015

Speech Communication

View full text Add to dashboard Cite

Privacy-Preserving Speech Analytics for Automatic Assessment of Student Collaboration

Bassiou

Tsiartas

Smith

et al. 2016

View full text Add to dashboard Cite

This work investigates whether nonlexical information from speech can automatically predict the quality of smallgroup collaborations. Audio was collected from students as they collaborated in groups of three to solve math problems. Experts in education annotated 30-second time windows by hand for collaboration quality. Speech activity features (computed at the group level) and spectral, temporal and prosodic features (extracted at the speaker level) were explored. After the latter were transformed from the speaker level to the group level, features were fused. Results using support vector machines and random forests show that feature fusion yields best classification performance. The corresponding unweighted average F1 measure on a 4-class prediction task ranges between 40% and 50%, significantly higher than chance (12%). Speech activity features alone are strong predictors of collaboration quality, achieving an F1 measure between 35% and 43%. Speaker-based acoustic features alone achieve lower classification performance, but offer value in fusion. These findings illustrate that the approach under study offers promise for future monitoring of group dynamics, and should be attractive for many collaboration activity settings in which privacy is desired.

show abstract

Combining standard and throat microphones for robust speech recognition

Graciarena

Franco

Sönmez

et al. 2003

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition

Kajarekar

Bratt

Shriberg

et al. 2006

View full text Add to dashboard Cite

Speaker recognition using prosodic and lexical features

Kajarekar

Ferrer

Venkataraman

et al.

View full text Add to dashboard Cite

Conventional speaker recognition systems identify speakers by using spectral information from very short slices of speech. Such systems perform well (especially in quiet conditions), but fail to capture idiosyncratic longer-term patterns in a speaker's habitual speaking style, including duration and pausing patterns, intonation contours, and the use of particular phrases. We investigate the contribution of modeling such prosodic and lexical patterns, on performance in the NIST 2003 Speaker Recognition Evaluation extended data task. We report results for (1) systems based on individual feature types alone, ( 2 ) systems in combination with a state-of-the-art frame-based baseline system, and (3) an all-system combination. Our results show that certain longer-term stylistic features provide powerful complementary information to both frame-level cepstral features and to each other. Stylistic features thus significantly improve speaker recognition performance over conventional systems, and offer promise for a variety of intelligence and security applications.

show abstract

The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System

Ferrer

Shriberg

Kajarekar

et al.

View full text Add to dashboard Cite

Adaptive and discriminative modeling for improved mispronunciation detection

Franco

Ferrer

Bratt

2014

View full text Add to dashboard Cite

In the context of computer-aided language learning, automatic detection of specific phone mispronunciations by nonnative speakers can be used to provide detailed feedback about specific pronunciation problems. In previous work we found that significant improvements could be achieved, compared to standard approaches that compute posteriors with respect to native models, by explicitly modeling both mispronunciations and correct pronunciations by nonnative speakers. In this work, we extend our approach with the use of model adaptation and discriminative modeling techniques, inspired on methods that have been effective in the area of speaker identification. Two systems were developed, one based on Bayesian adaptation of Gaussian Mixture Models (GMMs), and likelihood-ratio-based detection, and another one based on Support Vector Machines classification of supervectors derived from adapted GMMs.Both systems, and their combination, were evaluated in a phonetically transcribed Spanish database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers, showing significant improvements from our previous best system.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Harry Bratt

EduSpeak^®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications

Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems

Privacy-Preserving Speech Analytics for Automatic Assessment of Student Collaboration

Combining standard and throat microphones for robust speech recognition

A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition

Speaker recognition using prosodic and lexical features

The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System

Adaptive and discriminative modeling for improved mispronunciation detection

Contact Info

Product

Resources

About

Harry Bratt

EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications

Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems

Privacy-Preserving Speech Analytics for Automatic Assessment of Student Collaboration

Combining standard and throat microphones for robust speech recognition

A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition

Speaker recognition using prosodic and lexical features

The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System

Adaptive and discriminative modeling for improved mispronunciation detection

Contact Info

Product

Resources

About

EduSpeak^®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications