In this paper, we propose the idea of using the characteristics of a speaker's vowel space for automated assessment of second language (L2) proficiency. Specifically, we adpot features that were shown in previous studies to be good indicators of native speaker intelligibility and clarity and apply them to L2 speech from non-native speakers. The features focus on three peripheral vowels (IY, AA, and OW) and measure a speaker's coverage of the vowel space. A pilot study and a large-scale corpus study involving read speech produced by native and non-native speakers were conducted in which the vowel space features were rank correlated with pronunciation scores provided by human listeners for the non-native speech and an assumed higher score for the native speech. The results of the studies show that several of the features achieve moderately high correlations with the pronunciation scores, supporting their usefulness for automated assessment of non-native speech. The feature with the best performance in the largescale study was the F2 − F1 distance for IY, which achieved a correlation of 0.78 with pronunciation proficiency scores.
We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.