Usage of DNN in Speaker Recognition: Advantages and Problems

Kudashev, Oleg; Novoselov, Sergey; Pekhovsky, Timur; Simonchik, Konstantin; Lavrentyeva, Galina

doi:10.1007/978-3-319-40663-3_10

Cited by 11 publications

(8 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of the text-independent speaker recognition systems are based on the i-vector extraction framework. Typically, i-vector computation process can be decomposed into three stages: collection of sufficient statistics, extraction of i-vectors and a probabilistic linear discriminant analysis (PLDA) backend [2,1,4]. Sufficient statistics are collected by using a sequence of feature vectors, e.g.…”

Section: Baseline I-vectorsmentioning

confidence: 99%

“…The i-vector framework has inspired deep learning system design in this field. Particularly, in studies [2,4] they use an ASR deep neural network (ASR DNN) to divide acoustic space into senone classes, and the classic total variability (TV) model is applied to discriminate between speakers in that space [1]. In such phonetic discriminative DNN-based systems two major techniques can be distinguished.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On deep speaker embeddings for text-independent speaker recognition

Novoselov¹,

Shulipa²,

Kremnev³

et al. 2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

View full text Add to dashboard Cite

We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.

show abstract

Section: Baseline I-vectorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On deep speaker embeddings for text-independent speaker recognition

Novoselov¹,

Shulipa²,

Kremnev³

et al. 2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Nonetheless, this problem is gradually gaining attention from the deep learning perspective. Particularly, studies [2,4] make use of the ASR deep neural network (ASR DNN) in order to divide acoustic space into senone classes, and the classic total variability (TV) model is applied to discriminate between speakers in that space afterwards [1].…”

Section: Introductionmentioning

confidence: 99%

Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition

et al. 2018

Self Cite

View full text Add to dashboard Cite

Deep neural network based speaker embeddings become increasingly popular in the text-independent speaker recognition task. In contrast to a generatively trained i-vector extractor, a DNN speaker embedding extractor is usually trained discriminatively in the closed set classification scenario using softmax. The problem we addressed in the paper is choosing a dnn based speaker embedding backend solution for the speaker verification scoring. There are several options to perform speaker verification in the dnn embedding space. One of them is using a simple heuristic speaker similarity metric for scoring (e.g. cosine metric). Similarly with i-vector based systems, the standard Linear Discriminant Analisys (LDA) followed by the Probabilistic Linear Discriminant Analisys (PLDA) can be used for segregating speaker information. As an alternative, the discriminative metric learning approach can be considered. This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme. Results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate superiority and robustness of CSML based systems.

show abstract

Section: Introductionmentioning

confidence: 99%

Deep CNN Based Feature Extractor for Text-Prompted Speaker Recognition

Novoselov

Kudashev²,

Shchemelinin

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Deep learning is still not a very common tool in speaker verification field. We study deep convolutional neural network performance in the text-prompted speaker verification task. The prompted passphrase is segmented into word statesi.e. digits -to test each digit utterance separately. We train a single high-level feature extractor for all states and use cosine similarity metric for scoring. The key feature of our network is the Max-Feature-Map activation function, which acts as an embedded feature selector. By using multitask learning scheme to train the high-level feature extractor we were able to surpass the classic baseline systems in terms of quality and achieved impressive results for such a novice approach, getting 2.85% EER on the RSR2015 evaluation set. Fusion of the proposed and the baseline systems improves this result.

show abstract

Usage of DNN in Speaker Recognition: Advantages and Problems

Cited by 11 publications

References 10 publications

On deep speaker embeddings for text-independent speaker recognition

On deep speaker embeddings for text-independent speaker recognition

Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition

Deep CNN Based Feature Extractor for Text-Prompted Speaker Recognition

Contact Info

Product

Resources

About