On autoencoders in the i-vector space for speaker recognition

Pekhovsky, Timur; Novoselov, Sergey; Sholohov, Aleksei; Kudashev, Oleg

doi:10.21437/odyssey.2016-31

Cited by 37 publications

(19 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In most settings, DNNs are used as a replacement for Gaussian mixture models (GMMs) to improve the conventional i-vector approach [1] by having a more phonetically aware Universal Background Model (UBM) [2,3,4]. Other subsequent method based on DNN were introduced for noise-robust and domain-invariant i-vector [5,6,7] However, the process of training the GMM-UBM and extracting i-vectors largely remained the same.…”

Section: Introductionmentioning

confidence: 99%

Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model

Shon

Tang

Glass

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

show abstract

Section: Introductionmentioning

confidence: 99%

Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model

Shon

Tang

Glass

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…That prevented us from using any stand-alone VAD. The reader can refer to [7] for more DNN implementation details. 20 MFCC's (including log energy) were calculated using 23 filter banks in the range of 20-3700 Hz with their first-and second-order derivatives.…”

Section: Dnn -Based I-vector Systemmentioning

confidence: 99%

“…Aside from the standard PLDA we studied the application of a denoising autoencoder (DAE) based back-end [6,7] to SITW data "in the wild" conditions.…”

Section: Dae-based Back-endmentioning

confidence: 99%

“…( ) can be viewed as the maximum likelihood estimate in the following model of within-speaker variability: ( ) ( ( ) ), where ( ) is the Gaussian distribution with mean ( ) and covariance . Then we "unfold" the trained RBM to form the neural network which we refer to as denoising autoencoder (DAE) [7] (Figure 1, right). DAE is discriminatively trained (finetuned) to minimize within-speaker variability, defined in the following way:…”

Section: Dae-based Back-endmentioning

confidence: 99%

“…The influence of using artificially noised training data for minimization of the mismatch between train and evaluation conditions is studied. In addition to conventional PLDA, a novel back-end based on DAE-PLDA scheme [6,7] is investigated.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A Speaker Recognition System for the SITW Challenge

et al. 2016

Self Cite

View full text Add to dashboard Cite

This paper presents an ITMO university system submitted to the Speakers in the Wild (SITW) Speaker Recognition Challenge. During evaluation track of the SITW challenge we explored conventional universal background model (UBM) Gaussian mixture model (GMM) i-vector systems and recently developed DNN-posteriors based i-vector systems. The systems were investigated under the real-world media channel conditions represented in the challenge. This paper discusses practical issues of the robust i-vector systems training and performs investigation of denoising autoencoder (DAE) based back-end when applied to "in the wild" conditions. Our speaker diarization approach for "multi-speaker in the file" conditions is also briefly presented in the paper. Experiments performed on the evaluation dataset demonstrate that DNN-based i-vector systems are superior to the UBM-GMM based systems and applying DAE-based back-end helps to improve system performance.

show abstract