Conditional pronunciation modeling in speaker detection

Klusacek, D.; Navrátil, Jǐŕı; Reynolds, Douglas A.; Campbell, Joseph P.

doi:10.1109/icassp.2003.1202765

Cited by 26 publications

(23 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results show that there is significant benefit of fusing high-and low-level features for speaker verification. Among the high-level features investigated, the conditional pronunciation modeling (CPM) technique [30] that extracts multilingual phone sequences from utterances achieves the best performance [20]. One limitation of the CPM in [30] is that it requires multi-lingual corpora to build speaker and background models.…”

Section: Introductionmentioning

confidence: 99%

“…Among the high-level features investigated, the conditional pronunciation modeling (CPM) technique [30] that extracts multilingual phone sequences from utterances achieves the best performance [20]. One limitation of the CPM in [30] is that it requires multi-lingual corpora to build speaker and background models. To overcome this limitation, Leung et al [32] proposed using articulatory feature (AF) streams to construct CPM and called the resulting models AFCPM.…”

Section: Introductionmentioning

confidence: 99%

“…These works have led to extensive investigations into high-level features, in which prosodic features [18,[21][22][23][24][25], pronunciation features [26][27][28][29][30], and idiolect features [19,31] were proposed and combined with acoustic features. The results show that there is significant benefit of fusing high-and low-level features for speaker verification.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A new adaptation approach to high-level speaker-model creation in speaker verification

Zhang

Mak

2009

Speech Communication

View full text Add to dashboard Cite

Research has shown that speaker verification based on high-level speaker features requires long enrollment utterances to guarantee low error rate during verification. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrollment data, which will make the speaker models unreliable. This paper proposes four new adaptation methods for creating high-level speaker models to alleviate this undesirable effect. Unlike conventional methods in which only the phoneme-dependent background model is adapted, the proposed adaptation methods also adapts the phoneme-independent speaker model to fully utilize all the information available in the training data. A proportional factor, which is derived from the ratio between the phoneme-dependent background model and the phoneme-independent background model, is used to adjust the phonemeindependent speaker models during adaptation. The proposed method was evaluated under the NIST 2000 and NIST 2002 SRE frameworks. Experimental results show that the proposed adaptation method can alleviate the data-sparseness problem effectively and achieves a better performance when compared with traditional MAP adaptation.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A new adaptation approach to high-level speaker-model creation in speaker verification

Zhang

Mak

2009

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Textindependent speaker verification systems typically extract speaker features from short-term spectra of speech signals to build speaker-dependent Gaussian mixture models (GMMs) [1]. Studies have shown that combining low-level acoustic information with high-level speaker information-such as the usage or duration of particular words, prosodic features and articulatory features (AF)-can improve speaker verification performance [2][3][4][5][6].…”

Section: Introductionmentioning

confidence: 99%

Articulatory-feature based sequence kernel for high-level speaker verification

Zhang

Mak

2008

2008 International Conference on Machine Learning and Cybernetics

View full text Add to dashboard Cite

Research has shown that articulatory feature-based phoneticclass pronunciation models (AFCPMs) can capture the pronunciation characteristics of speakers. However, the scoring method used in AFCPMs does not explicitly use the discriminative information available in the training data. To harness this information, this paper proposes converting speaker models to supervectors by stacking the discrete densities in AFCPMs. An AF-kernel is constructed from the supervectors of target speakers, background speakers, and claimants. An AF-kernel based SVM is then trained to classify the supervectors. Results show that AF-kernel scoring is complementary to likelihood-ratio scoring, leading to better performance when the two scoring methods are combined.

show abstract

“…This line of research, which is generally referred to as phonetic speaker recognition, was pioneered by Andrews et al, who used relative frequencies of phone n-grams to capture sequential patterns in an individual's speech [1,2]. This work was subsequently extended in various papers, such as the work of the "SuperSID" team at the JHU 2002 Summer Workshop [5,6,7]. In 2003, Campbell et al used support vector machines (SVMs) to train phonetic speaker models [3].…”

Section: Introductionmentioning

confidence: 99%

Improved Phonetic Speaker Recognition Using Lattice Decoding

Hatch

Peskin

Stolcke

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

View full text Add to dashboard Cite

The current "state-of-the-art" in phonetic speaker recognition uses relative frequencies of phone n-grams as features for training speaker models and for scoring test-target pairs. Typically, these relative frequencies are computed from a simple 1-best phone decoding of the input speech. In this paper, we present results on the Switchboard-2 corpus, where we compare 1-best phone decodings versus lattice phone decodings for the purposes of performing phonetic speaker recognition. The phone decodings are used to compute relative frequencies of phone bigrams, which are then used as inputs for two standard phonetic speaker recognition systems: a system based on log-likelihood ratios (LLRs) [1,2], and a system based on support vector machines (SVMs) [3]. In each experiment, the lattice phone decodings achieve relative reductions in equal-error rate (EER) of between 31% and 66% below the EERs of the 1-best phone decodings. Our best phonetic system achieves an EER of 2.0% on 8-conversation training and 1.4% when combined with a GMM-based system.

show abstract

Conditional pronunciation modeling in speaker detection

Cited by 26 publications

References 8 publications

A new adaptation approach to high-level speaker-model creation in speaker verification

A new adaptation approach to high-level speaker-model creation in speaker verification

Articulatory-feature based sequence kernel for high-level speaker verification

Improved Phonetic Speaker Recognition Using Lattice Decoding

Contact Info

Product

Resources

About