Real-time speaker identification and verification

In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model. We use two SV systems: standard GMM-UBM-based and a newer SVM-based. Our results show when the systems are tested with human speech, there are zero false acceptances and zero false rejections. However, when the systems are tested with synthesized speech, all claims for the targeted speaker are accepted while all other claims are rejected. We propose a two-step process for detection of synthesized speech in order to prevent this imposture. Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech will lead to an unacceptably high false acceptance rate.

show abstract

“…The baseline EER is 8.0% for NIST 2002 corpus (100 speakers' test signals). These EERs closely agree with published values [10], [2].…”

Section: Speaker Verification Systemsupporting

confidence: 92%

Revisiting the security of speaker verification systems against imposture using synthetic speech

León

Apsingekar

Pucher

et al. 2010

2010 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…Generative models characterize the distribution of the feature vectors within the classes (speakers), whereas discriminative modeling focuses on modeling the decision boundary between the classes. For generative modeling, vector quantization (VQ) [8,28,32,45,74,80] and Gaussian mixture model (GMM) [67,68] are commonly used. For discriminative training, artificial neural networks (ANNs) [17,83] and, more recently, support vector machines (SVMs) [10,11] are representative techniques.…”

Section: Universal Background Model (Ubm)mentioning

confidence: 99%

Comparison of clustering methods: A case study of text-independent speaker modeling

Kinnunen¹,

Sidoroff²,

Tuononen³

et al. 2011

Pattern Recognition Letters

View full text Add to dashboard Cite

Clustering is needed in various applications such as biometric person authentication, speech coding and recognition, image compression and information retrieval. Hundreds of clustering methods have been proposed for the task in various fields but, surprisingly, there are few extensive studies actually comparing them. An important question is how much the choice of a clustering method matters for the final pattern recognition application. Our goal is to provide a thorough experimental comparison of clustering methods for text-independent speaker verification. We consider parametric Gaussian mixture model (GMM) and non-parametric vector quantization (VQ) model using the best known clustering algorithms including iterative (K-means, random swap, expectation-maximization), hierarchical (pairwise nearest neighbor, split, split-and-merge), evolutionary (genetic algorithm), neural (self-organizing map) and fuzzy (fuzzy C-means) approaches. We study recognition accuracy, processing time, clustering validity, and correlation of clustering quality and recognition accuracy. Experiments from these complementary observations indicate clustering is not a critical task in speaker recognition and the choice of the algorithm should be based on computational complexity and simplicity of the implementation. This is mainly because of three reasons: the data is not clustered, large models are used and only the best algorithms are considered. For low-order models, choice of the algorithm, however, can have a significant effect. Index Terms

show abstract

“…Therefore research has been focusing on decreasing the computational load of identification while attempting to keep the recognition accuracy reasonably high. In a research concentrating on optimizing vector quantization (VQ) based speaker identification, the number of test vectors are reduced by pre-quantizing the test sequence prior to matching, and the number of speakers are reduced 7 by pruning out unlikely speakers during the identification process (Kinnunen et al, 2006). The best variants are then generalized to Gaussian Mixture Model (GMM) based modeling.…”

Section: Speaker Recognition On Mobile Phonementioning

confidence: 99%