Automatic text-independent pronunciation scoring of foreign language student speech

Han

IEICE Trans. Inf. & Syst.

2010

SUMMARYIn this letter, we present useful features accounting for pronunciation prominence and propose a classification technique for prominence detection. A set of phone-specific features are extracted based on a forced alignment of the test pronunciation provided by a speech recognition system. These features are then applied to the traditional classifiers such as the support vector machine (SVM), artificial neural network (ANN) and adaptive boosting (Adaboost) for detecting the place of prominence.

“…The test result is given in Table 3 Fig. 4 Comparison according to feature incorporations. where the performance is compared with that obtained when the exact transcript was provided.…”

Section: Resultsmentioning

confidence: 99%

Section: Durationmentioning

confidence: 99%

Study of Prominence Detection Based on Various Phone-Specific Features

Han

IEICE Trans. Inf. & Syst.

2010

IEICE Trans. Inf. & Syst.

“…The disadvantage of these methods is that they are text-dependent, so they only work for the utterances with the same text of the native recordings, but can not be used on other utterances. Neumeyer et al presented a textindependent pronunciation assessment framework [11] in 1996, then they improved the method by using the posterior probabilities instead of decoding log-likelihood [12], [13]. Witt et al combined the advantages of these works and presented the GOP method.…”

Section: Introductionmentioning

confidence: 99%

A Novel Discriminative Method for Pronunciation Quality Assessment

Zhang¹,

Pan²,

Dong³

et al. 2013

SUMMARYIn this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular "Goodness of Pronunciation" (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level.

“…The duration parameter is then normalised by considering the mean duration of the syllable nuclei in the utterance. This is a standard technique for Rate-Of-Speech (ROS) normalisation, described, for example, in Neumeyer (1996) and Venkata Ramana (2000).…”

Section: Durationmentioning

confidence: 99%

An Automatic System for Detecting Prosodic Prominence in American English Continuous Speech

Tamburini¹,

Caini²

2005

Genet Resour Crop Evol

Abstract. A precise identification of prosodic phenomena and the construction of tools able to properly manage such phenomena are essential steps to disambiguate the meaning of certain utterances. In particular they are useful for a wide variety of tasks: automatic recognition of spontaneous speech, automatic enhancement of speechgeneration systems, solving ambiguities in natural language interpretation, the construction of large annotated language resources, such as prosodically tagged speech corpora, and teaching languages to foreign students using Computer Aided Language Learning (CALL) systems. This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Prosodic prominence involves two different prosodic features: pitch accent and stress accent. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature. Two different prominence detectors were studied and developed: the first uses a training corpus to set up thresholds properly, while the second uses a pure unsupervised method. In both cases, it is worth stressing that only acoustic parameters derived directly from speech waveforms are exploited.