Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model

Brümmer, Niko; Silnova, Anna; Burget, Lukáš; Stafylakis, Themos

doi:10.21437/odyssey.2018-49

Cited by 29 publications

(32 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latent identity variable framework [26] assumes that y is a pure representation of a person's identity and that there is a distribution on Y with known probability density function p(y). Given a likelihood function for the latent identity variable (e.g., meta-embedding [28]), one can make inferences about speaker identities within a set of speech utterances. Examples of such tasks include speaker verification, identification and clustering [29].…”

Section: Reinterpreting False Alarm Rate As Averaged Speaker-pair Conmentioning

confidence: 99%

Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

Sholokhov

Kinnunen

Vestman

et al. 2020

Computer Speech & Language

View full text Add to dashboard Cite

How secure automatic speaker verification (ASV) technology is? More concretely, given a specific target speaker, how likely is it to find another person who gets falsely accepted as that target? This question may be addressed empirically by studying naturally confusable pairs of speakers within a large enough corpus. To this end, one might expect to find at least some speaker pairs that are indistinguishable from each other in terms of ASV. To a certain extent, such aim is mirrored in the standardized ASV evaluation benchmarks, for instance, the series of speaker recognition evaluation (SRE) organized by the National Institute of Standards and Technology (NIST). Nonetheless, arguably the number of speakers in such evaluation benchmarks represents only a small fraction of all possible human voices, making it challenging to extrapolate performance beyond a given corpus. Furthermore, the impostors used in performance evaluation are usually selected randomly. A potentially more meaningful definition of an impostor -at least in the context of security-driven ASV applications -would be closest (most confusable) other speaker to a given target.We put forward a novel performance assessment framework to address both the inadequacy of the random-impostor evaluation model and the size limitation of evaluation corpora by addressing ASV security against closest impostors on arbitrarily large datasets. The framework allows one to make a prediction of the safety of given ASV technology, in its current state, for arbitrarily large speaker database size consisting of virtual (sampled) speakers. As a proof-of-concept, we analyze the performance of two state-of-the-art ASV systems, based on i-vector and x-vector speaker embeddings (as implemented in the popular Kaldi toolkit), on the recent VoxCeleb 1 & 2 corpora, containing a total of 7,365 speakers. We fix the number of target speakers to 1000, and generate up to N = 100, 000 virtual impostors sampled from the generative model. The model-based false alarm rates are in a reasonable agreement with empirical false alarm rates and, as predicted, increase substantially (values up to 98%) with N = 100, 000 impostors. Neither the i-vector or x-vector system is immune to increased false alarm rate at increased impostor database size, as predicted by the model.

show abstract

Section: Reinterpreting False Alarm Rate As Averaged Speaker-pair Conmentioning

confidence: 99%

Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

Sholokhov

Kinnunen

Vestman

et al. 2020

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…This means that we have to find approximations for both scoring and training. We make use of a new approximation, the Gaussian likelihood approximation, as recently published in [1]. In that paper, the approximation was used for both scoring and discriminative training.…”

Section: Ht-plda Modelmentioning

confidence: 99%

“…Both scoring and training recipes can be built around the likelihood for the hidden speaker identity variable, given the observation. Marginalization over the hidden variable, λij, gives a multivariate t-distribution for the observed vector [8,9,1]:…”

Section: The Gaussian Likelihood Approximationmentioning

confidence: 99%

See 1 more Smart Citation

Fast Variational Bayes for Heavy-tailed PLDA Applied to i-vectors and x-vectors

et al. 2018

Self Cite

View full text Add to dashboard Cite

The standard state-of-the-art backend for text-independent speaker recognizers that use i-vectors or x-vectors, is Gaussian PLDA (G-PLDA), assisted by a Gaussianization step involving length normalization. G-PLDA can be trained with both generative or discriminative methods. It has long been known that heavy-tailed PLDA (HT-PLDA), applied without length normalization, gives similar accuracy, but at considerable extra computational cost. We have recently introduced a fast scoring algorithm for a discriminatively trained HT-PLDA backend. This paper extends that work by introducing a fast, variational Bayes, generative training algorithm. We compare old and new backends, with and without length-normalization, with i-vectors and x-vectors, on SRE'10, SRE'16 and SITW.

show abstract

“…The recent work in this direction focuses on using the speaker embeddings that are scored with a probabilistic linear discriminant analysis (PLDA) based back-end [18,19]. This kind of systems give comparable or better results to that obtained with i-vector speaker modeling.…”

Section: Introductionmentioning

confidence: 99%

Generative X-Vectors for Text-Independent Speaker Verification

Das

Yılmaz

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose a novel method to include the complementary information of i-vector and x-vector, that is called generative x-vector. The generative x-vector utilizes a transformation model learned from the i-vector and x-vector representations of the background data. Canonical correlation analysis is applied to derive this transformation model, which is later used to transform the standard x-vectors of the enrollment and test segments to the corresponding generative x-vectors. The SV experiments performed on the NIST SRE 2010 dataset demonstrate that the system using generative x-vectors provides considerably better performance than the baseline i-vector and x-vector systems. Furthermore, the generative x-vectors outperform the fusion of i-vector and x-vector systems for long-duration utterances, while yielding comparable results for short-duration utterances.

show abstract

Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model

Cited by 29 publications

References 21 publications

Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores

Fast Variational Bayes for Heavy-tailed PLDA Applied to i-vectors and x-vectors

Generative X-Vectors for Text-Independent Speaker Verification

Contact Info

Product

Resources

About