Context-dependent modeling for acoustic-phonetic recognition of continuous speech

Schwartz, Richard; Chow, Yen‐Lu; Kimball, Owen; Roucos, S.; Krasner, M.; Makhoul, John

doi:10.1109/icassp.1985.1168283

Cited by 172 publications

(78 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These augmented triphones, called "PIC"s, are the fundamental unit of the system, and are closely related to other approaches that have appeared in the literature ( [16] and [14]). The information that the PICs currently contain is the identity of the preceding and succeeding phonemes, and, optionally, an estimate of the degree of the phoneme's prepausal lengthening.…”

Section: Report Documentation Pagementioning

confidence: 99%

Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems

Baker¹,

Baker²,

Bamberg³

et al. 1992

View full text Add to dashboard Cite

In this paper we present some of the algorithm improvements that have been made to Dragon's continuous speech recognition and training prograxns, improvements that have more than halved our error rate on the Resource Management task since the last SLS meeting in February 1991. We also report the "dry run" results that we have obtMned on the 5000-word speaker-dependent Wall Street Journal recognition task, and outline our overall research strategy and plans for the future.In our system, a set of output distributions, known as the set of PELs (phonetic elements), is associated with each phoneme. The HMM for a PIC (phoneme-in-context) is represented as a linear sequence of states, each having an output distribution chosen from the set of PELs for the given phoneme, and a (double exponential) duration distribution.In this paper we report on two methods of acoustic modeling and tr~ning. The first method involves generating a set of (unimodal) PELs for a given speaker by clustering the hypothetical frames found in the spectral models for that speaker, and then constructing speaker-dependent PEL sequences to represent each PIC. The "spectral model" for a PIC is simply the expected value of the sequence of frames that would be generated by the PIC. The second method represents the probability distribution for each parameter in a PEL as a mixture of a fixed set of unimodal components, the mixing weights being estimated using the EM algorithm. In both models we assume that the parameters axe statistically independent.We report results obtained using each of these two methods (RePELing/Respelling and univariate "tied mixtures") on the 5000-word closed-vocabulary verbalized punctuation version of the Wall Street Journal task.

show abstract

Section: Report Documentation Pagementioning

confidence: 99%

Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems

Baker¹,

Baker²,

Bamberg³

et al. 1992

View full text Add to dashboard Cite

show abstract

“…Given that in every step of the iteration the likeli hood evaluated at the current Lo and Qo (1:T) is increas ing(or at least equal) to the likelihood evaluated at the previous Lo and Qo (1:T), a maximum of (6) is eventually reached, which corresponds to reach a local maximum of an approximation to (1).…”

Section: Model Optimizationmentioning

confidence: 99%

Recognition of intervocalic stops in continuous speech using context-dependent HMMs

Franco¹

1990

J. Acoust. Soc. Jpn. (E), J Acoust Soc Jpn E

View full text Add to dashboard Cite

In this work the design and evaluation of the recognition performance of context dependentHidden Markov Models (HMMs) for the intervocalic voiced and unvoiced stops is described. The phoneme HMMs are context-dependent in order to account for coarticulatory effects. Continuous probability density functions are used for the out putvectors.Initial model parameter estimates are obtained by means of an automatic segmentation procedure for careful modeling of relevant phonetic features. The model structure and the training scheme are directed to associate the most acoustically dis criminativesegments of the consonants with a sequence of states in every consonant model. The speech data base consisted of a total of 2,592 productions of the Spanish Stops /p, t, k, b, d, g/ in intervocalic positions with the vowels /a, i, u/ embedded in VCVCVCV nonsense utterances. The speech data has been produced by two male Argentine Spanish speakers. Phoneme recognition is accomplished finding the state sequence with highest likelihood in an ergodic model formed by the linking of all the context-dependent phoneme models allowing only the phonotactically valid state transi tions.A comparative study of the recognition performance under different degrees of context dependence, and the alternative use of spectral dynamic and energy related pa rametersis presented.

show abstract

“…(Cross-word triphones, which are a feature of the old TS decoder, will be implemented later.) These models are smoothed with reduced context phone models [20]. Each phone model is a three state "linear" (no skip transitions) HMM.…”

Section: The Basic Hmm Systemmentioning

confidence: 99%

“…The system uses Gaussian tied mixture [4,6] observation pdfs and treats each observation stream as if it is statistically independent of all others. Triphone models [20] are used to model phonetic coarticulation. (Cross-word triphones, which are a feature of the old TS decoder, will be implemented later.)…”

Section: The Basic Hmm Systemmentioning

confidence: 99%

The Lincoln large-vocabulary HMM CSR

Paul

1992

Proceedings of the Workshop on Speech and Natural Language - HLT '91

View full text Add to dashboard Cite

The work described here focuses on recognition of the Wall Street Journal (WSJ) pilot database [17], a new CSR database which supports 5K, 20K, and up to 64K-word CSR tasks. The original Lincoln Tied-Mixture HMM CSR was implemented using a time-synchronous beam-pruned search of a static network [14] and does not extend well to this task because the recognition network would be too large for currently practical workstations. Therefore, the recognizer has been converted to a stack decoder-based search strategy [I,7,16]. This decoder has been shown to function effectively on up to 64K-word recognition of continuous speech. This paper describes the acoustic modeling techniques and the implementation of the stack decoder used to obtain these results.

show abstract

Context-dependent modeling for acoustic-phonetic recognition of continuous speech

Cited by 172 publications

References 6 publications

Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems

Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems

Recognition of intervocalic stops in continuous speech using context-dependent HMMs

The Lincoln large-vocabulary HMM CSR

Contact Info

Product

Resources

About