Mispronunciation Detection in Non-Native (L2) English with Uncertainty Modeling

Korzekwa, Daniel; Lorenzo-Trueba, Jaime; Zaporowski, Szymon; Calamaro, Shira; Drugman, Thomas; Kostek, Bożena

doi:10.1109/icassp39728.2021.9413953

Cited by 10 publications

(12 citation statements)

References 16 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The effectiveness of these techniques is assessed in two tasks: detecting mispronounced words (replacing, adding, removing phonemes, or pronouncing an unknown speech sound) and detecting lexical stress errors. The results presented in this study are the culmination of our recent work on speech generation in pronunciation error detection task [11,22,23], including a new S2S technique.…”

Section: Introductionmentioning

confidence: 90%

“…For example, the word 'enough' can be pronounced by native speakers in multiple ways: /ih n ah f/ or /ax n ah f/ (short 'i' or 'schwa' phoneme at the beginning). In our previous work, we solve these problems by creating a native speech pronunciation model that returns the probability of the sentence to be spoken by a native speaker [11].…”

Section: Phoneme Recognition Approachesmentioning

confidence: 99%

“…Techniques based on phoneme recognition can be supplemented by a reference speech signal obtained from the speech database [37][38][39] or generated from the phonetic representation [11,40]. Xiao et al [37] use a pair of speech signals from a student and a native speaker to classify native and non-native speech.…”

Section: Phoneme Recognition Approachesmentioning

confidence: 99%

“…It has been shown to help people practice and improve their pronunciation skills [6][7][8]. CAPT consists of two components: an automated pronunciation evaluation component [9][10][11] and a feedback component [12]. The automated pronunciation evaluation component is responsible for detecting pronunciation errors in spoken speech, for example, for detecting words pronounced incorrectly by the speaker.…”

Section: Introductionmentioning

confidence: 99%

“…Researcher have given most attention to studying various machine learning models such as Bayesian networks [14,15] and deep learning methods [9,10], as well as analyzing different representations of the speech signal such as prosodic features (duration, energy and pitch) [16], and cepstral/spectral features [9,13,17]. Despite significant progress in recent years, existing CAPT methods detect pronunciation errors with relatively low accuracy of 60% precision at 40%-80% recall [9][10][11]. Highlighting correctly pronounced words as pronunciation errors by the CAPT tool can demotivate students and lower the confidence in the tool.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

Korzekwa,

Lorenzo-Trueba,

Drugman

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60% precision at 40%-80% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41% from 0.528 to 0.749 compared to the state-of-the-art approach.

show abstract

Section: Introductionmentioning

confidence: 90%

Section: Phoneme Recognition Approachesmentioning

confidence: 99%