1989
DOI: 10.1109/29.46546
|View full text |Cite
|
Sign up to set email alerts
|

Speaker-independent phone recognition using hidden Markov models

Abstract: In this paper, we extend hidden Markov modeling to speaker-independent phone recognition. Using multiple codebooks of various LPC parameters and discrete HMMs, we obtain a speakerindependent phone recognition accuracy of 58.8% to 73.8% on the TIMTT database, depending on the type of acoustic and language models used. In comparison, the performance of expert spectrogram readers is only 69% without use of higher level knowledge. We also introduce the co-occurrence smoothing algorithm which enables accurate recog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
382
1
12

Year Published

2003
2003
2013
2013

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 771 publications
(409 citation statements)
references
References 16 publications
3
382
1
12
Order By: Relevance
“…Phoneme set -The phoneme set consists of 39 phonemes. It is very similar to the CMU/MIT phoneme set [2], but closures were merged with burst instead of with silence (bcl b → b). We believe it is more appropriate for features which use a longer temporal context such as TRAPs.…”
Section: Methodsmentioning
confidence: 77%
“…Phoneme set -The phoneme set consists of 39 phonemes. It is very similar to the CMU/MIT phoneme set [2], but closures were merged with burst instead of with silence (bcl b → b). We believe it is more appropriate for features which use a longer temporal context such as TRAPs.…”
Section: Methodsmentioning
confidence: 77%
“…We used the TIMIT corpus with a sampling rate of 16 kHz. The "SA" sentences have not been used to avoid an unfair bias for certain phonemes [21]. In order to simulate mismatching training and testing conditions with respect to the mean VTL, the training and testing data was split into male and female subsets and three scenarios were defined: 1) Training on both male and female data and testing on male and female data (FM-FM), 2) training on male data and testing on female data (M-F) and 3) training on female data and testing on male data (F-M).…”
Section: A Experimental Setupmentioning
confidence: 99%
“…In the train-test partitioning of the data we followed the widely accepted standard: the full set of the 3696 train sentences were used for training, and testing was always executed on the full test dataset of 1344 utterances. The phonetic labels of the database were fused into 39 categories, which is again standard practice [11].…”
Section: Experiments With Clean Speechmentioning
confidence: 99%