2017
DOI: 10.1109/taslp.2017.2692304
|View full text |Cite
|
Sign up to set email alerts
|

DNN-Driven Mixture of PLDA for Robust Speaker Verification

Abstract: The mismatch between enrollment and test utterances due to different types of variabilities is a great challenge in speaker verification. Based on the observation that the SNRlevel variability or channel-type variability causes heterogeneous clusters in i-vector space, this paper proposes to apply supervised learning to drive or guide the learning of PLDA mixture models. Specifically, a deep neural network (DNN) is trained to produce the posterior probabilities of different SNR levels or channel types given i-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
4

Relationship

4
5

Authors

Journals

citations
Cited by 19 publications
(8 citation statements)
references
References 37 publications
0
7
0
Order By: Relevance
“…In [15], Mak et al advocated that utterances of different SNR levels will not only cause the i-vectors to fall on different regions of the i-vector spaces but also change the orientation of the speaker subspace. A mixture PLDA model with mixture alignments determined by the SNR level of utterances [15] or by their i-vectors [16] was then derived to model the SNR-dependent i-vectors.…”
Section: Introductionmentioning
confidence: 99%
“…In [15], Mak et al advocated that utterances of different SNR levels will not only cause the i-vectors to fall on different regions of the i-vector spaces but also change the orientation of the speaker subspace. A mixture PLDA model with mixture alignments determined by the SNR level of utterances [15] or by their i-vectors [16] was then derived to model the SNR-dependent i-vectors.…”
Section: Introductionmentioning
confidence: 99%
“…The identity of a person can be verified in the biometric authentication systems using personal attributes, such as speech [16,17], face [18,19], fingerprints [20,21], palmprint [22,23], gait [24,25], and iris [26,27]. These physiological and behavioral attributes of humans are more reliable in authentication compared with knowledge-based or token-based approaches because these attributes cannot be stolen and are unique for every individual.…”
Section: Introductionmentioning
confidence: 99%
“…For example, Hasan et al [11] and Garcia-Romero et al [12] trained a PLDA model by pooling speeches from multiple conditions, and Li and Mak [13], [14] modeled the noise-level variability in utterances by introducing an SNR factor and an SNR subspace into the PLDA model. In [15], [16], Mak et al advocated that utterances of different SNR levels will not only cause i-vectors to fall on different regions of the i-vector spaces but also change the orientation of the speaker subspace. A mixture PLDA model with mixture alignments determined by the SNR level of utterances was then derived to model SNR-dependent i-vectors.…”
Section: Introductionmentioning
confidence: 99%