Speaker and Session Variability in GMM-Based Speaker Verification

Kenny

IEEE Trans. Audio Speech Lang. Process.

et al. 2011

Self Cite

3,299

2,556

Abstract-This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker-and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.Index Terms-Cosine distance scoring, joint factor analysis (JFA), support vector machines (SVMs), total variability space.

Section: A Databasesmentioning

confidence: 99%

Front-End Factor Analysis for Speaker Verification

Kenny

IEEE Trans. Audio Speech Lang. Process.

et al. 2011

Self Cite

3,299

2,556

“…The preliminary experiments of [3,8] were reported on the NIST 2002 and 2006 SRE corpora using a lightweight Gaussian mixture model-universal background model (GMM-UBM) system [17] and generalized linear discriminant sequence support vector machine (GLDS-SVM) without any session variability compensation techniques. The recent results of [36], using multi-taper MFCC features only, were reported on NIST 2002 and 2008 SRE corpora using GMM-UBM, GMM-SVM and joint factor analysis (JFA) [38,39] classifiers.…”

Section: Introductionmentioning

confidence: 99%

Multitaper MFCC and PLP features for speaker verification using i-vectors

Alam

Kinnunen

Kenny

et al. 2013

Speech Communication

Self Cite

In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-ofthe-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a singletapered spectrum estimate has large variance, which can be reduced by averaging spectral estimates obtained using a set of different tapers, leading to a so-called multitaper spectral estimate. The multi-taper spectrum estimation method has proven to be powerful especially when the spectrum of interest has a large dynamic range or varies rapidly. Multi-taper MFCC features were also recently studied in speaker verification with promising preliminary results. In this study our primary goal is to validate those findings using an up-to-date i-vector classifier on the latest NIST 2010 SRE data. In addition, we also propose to compute robust perceptual linear prediction (PLP) features using multitapers. Furthermore, we provide a detailed comparison between different taper weight selections in the Thomson multi-taper method in the context of speaker verification. Speaker verification results on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the latest NIST 2010 SRE corpus indicate that the multitaper methods outperform the conventional periodogram technique. Instead of simply averaging (using uniform weights) the individual spectral estimates in forming the multitaper estimate, weighted averaging (using non-uniform weights) improves performance.Compared to the MFCC and PLP baseline systems, the sine-weighted cepstrum estimator 2 (SWCE) based multitaper method provides average relative reductions of 12.3% and 7.5% in equal error rate, respectively. For the multi-peak multi-taper method, the corresponding reductions are 12.6% and 11.6%, respectively. Finally, the Thomson multitaper method provides error reductions of 9.5% and 5.0% in EER for MFCC and PLP features, respectively. We conclude that both the MFCC and PLP features computed via multitapers provide systematic improvements in recognition accuracy.

“…MFCC coefficients are used for extracting features and minimum processing time in GMM is 10 ms for speech utterance. The parameters for GMM model is mean vectors, densities (is a sum of M numbers component density), and covariance matrices [3]. b. SVM -SVM is a discriminative speaker model which based on targeted speaker as well as imposter speaker.…”

Section: Speaker Modelsmentioning

confidence: 99%

“…Open set includes any number of registered speakers, and there is a possibility that unknown speaker also present, known as imposter. Imposter means the voice of the person is not belonging from the specific speaker [3]. Speaker verification is a task to check whether or not a voice token belongs to a specific speaker.…”

Section: Introductionmentioning

confidence: 99%

A Gaussian Mixture Model-Based Speaker Recognition System

Gorai

Abraham

2017

Asian J Pharm Clin Res

A human being has lot of unique features and one of them is voice. Speaker recognition is the use of a system to distinguish and identify a person from his/ her vocal sound. A speaker recognition system (SRS) can be used as one of the authentication technique, in addition to the conventional authentication methods. This paper represents the overview of voice signal characteristics and speaker recognition techniques. It also discusses the advantages and problem of current SRS. The only biometric system that allows users to authenticate remotely is voice-based SRS, we are in the need of a robust SRS.