Mispronunciation Diagnosis of L2 English at Articulatory Level Using Articulatory Goodness-Of-Pronunciation Features

Ryu, Hyuksu; Chung, Minhwa

doi:10.21437/slate.2017-12

Cited by 11 publications

(4 citation statements)

References 16 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sudhakara et al [16] introduced context-aware GOP which takes both senone and transition state probabilities into consideration. Ryu et al [17] inferred that pronunciation scoring must combine phone level as well as articulatorylevel diagnoses such as voicing, place of articulation, and manner of articulation on consonants. Lin et al [18] used the acoustic model and replaced the forced alignment layer with a self-attention layer to get an utterance score based on transfer learning, but the results greatly depend on fine-tuning the scores of datasets.…”

Section: Related Workmentioning

confidence: 99%

“…The former approaches discussed are based only on the pronunciation scoring based on the likelihood of individual phones in sequential order, hence is limited to phone-level features extraction of the audio file to test [13][14][15][16][17][18][19]. The latter checks for a minimum distance for comparing different length time series or MFCC or LPC for audio comparison [20], [21], [22].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Pronunciation Scoring With Goodness of Pronunciation and Dynamic Time Warping

Sheoran¹,

Bajgoti²,

Gupta³

et al. 2023

IEEE Access

View full text Add to dashboard Cite

The current pronunciation scoring based on Goodness of Pronunciation (GOP) uses posterior probabilities of the Acoustic Models. Such algorithms suffer from generalization since they are utilized to determine a score metric for each phoneme rather than on the completeness or comparison with the ideal utterance of the words. In this paper, a novel method is proposed for computing scores calculated using combined scores of prosodic, fluency, completeness, and accuracy. This is achieved using contextaware GOP in conjugation with dynamic time warping (DTW) matching of the pitch contours of a weighted average of the context tokens found in the audio file that is rich in mispronounced phonemes. The proposed work gives flexibility in tuning the results according to different speech aspects based on a single hyperparameter. The results achieved are encouraging and have been validated on the speechocean762 dataset, where Automatic Speech Recognition (ASR) model has been trained on the Librispeech dataset. The resultant mean error of the proposed approach is 3.38% and the value of the correlation coefficient achieved is 0.652.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Pronunciation Scoring With Goodness of Pronunciation and Dynamic Time Warping

Sheoran¹,

Bajgoti²,

Gupta³

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Another application of phones recognition and articulatory features estimation is Computer-Assisted Pronunciation Training. Some of the approaches are described in [8] and [9]. This paper dwells on applications of attention-based models to articulatory features detection.…”

Section: Introductionmentioning

confidence: 99%

“…Another application of phones recognition and articulatory features estimation is Computer-Assisted Pronunciation Training. Some of the approaches are described in [8] and [9].…”

Section: Introductionmentioning

confidence: 99%

Attention Model for Articulatory Features Detection

Karaulov¹,

Tkanov²

2019

Interspeech 2019

View full text Add to dashboard Cite

Articulatory distinctive features, as well as phonetic transcription, play important role in speech-related tasks: computerassisted pronunciation training, text-to-speech conversion (TTS), studying speech production mechanisms, speech recognition for low-resourced languages. End-to-end approaches to speech-related tasks got a lot of traction in recent years. We apply Listen, Attend and Spell (LAS) [1] architecture to phones recognition on a small small training set, like TIMIT [2]. Also, we introduce a novel decoding technique that allows to train manners and places of articulation detectors end-to-end using attention models. We also explore joint phones recognition and articulatory features detection in multitask learning setting.

show abstract