Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

Xiang, Xu; Wang, Shuai; Huang, Houjun; Ye, Qian; Yu, Kai

doi:10.1109/apsipaasc47483.2019.9023039

Cited by 96 publications

(39 citation statements)

References 24 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The standard training criterion is therefore cross en- tropy. More discriminative criteria called angular softmax loss (Asoftmax) and their variants have recently been proposed and evaluated in [20,24,30]. The criteria considers angular margins between classes and is expected to produce more separable embedding representations.…”

Section: Classifiermentioning

confidence: 99%

“…Villalba et al summarized several state-of-the-art speaker recognition systems for the NIST SRE18 Challenge [16], where x-vector based systems [17] consistently outperformed i-vector based systems [18]. There has also been a surge of interest in new encoding methods and endto-end loss functions for speaker recognition [19,20,21,22,23,24,25]. One prominent advancement is the use of learnable dictionary encoding (LDE) [19] and angular softmax [20] for speaker recognition, which are reported to boost the speaker recognition performance on open-source corpora such as the VoxCelebs [26,27].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Cooper

Lai

Yasuda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

133

125

View full text Add to dashboard Cite

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in endto-end speech synthesis.

show abstract

Section: Classifiermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Cooper

Lai

Yasuda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

133

125

View full text Add to dashboard Cite

show abstract

“…As shown in Table 7, Vox1, Vox1-E and Vox1-H denoted the VoxCeleb1, VoxCeleb1-E and VoxCeleb1-H test dataset, respectively. We used AAM-Softmax as the loss function [22]. Experimental results showed that our method had an improvement of 0.56%, 0.88% and 1.69% on the VoxCeleb1, VoxCeleb1-E and VoxCeleb1-H test dataset, respectively.…”

Section: Comparison and Analysismentioning

confidence: 99%

“…The database that supports the conclusions of this article is available in the [VoxCeleb [21,22] database] repository [Unique persistent identifier and hyperlink to the dataset at https://www.robots.ox.ac.uk/ vgg/data/voxceleb/ . ]…”

Section: Availability Of Data and Materialsmentioning

confidence: 99%

Text-Independent Speaker Recognition Based on Adaptive Course Learning Loss and Deep Residual Network

Zhong¹,

Dai

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Text-independent speaker recognition is widely used in identity recognition. In order to improve the features recognition ability, a method of text-independent speaker recognition based on a deep residual network model was proposed in this paper. Firstly, the original audio was extracted with a 64-dimensional log filter bank signal features. Secondly, a deep residual network was used to extract log filter bank signal features. The deep residual network was composed of a residual network and a Convolutional Attention Statistics Pooling (CASP) layer. The CASP layer could aggregate the frame-level features from the residual network into utterance-level features. Lastly, Adaptive Curriculum Learning Loss (ACLL) classifiers was used to optimize the output of abstract features by the deep residual network, and the text-independent speaker recognition was completed by ACLL classifiers. The proposed method was applied to a large VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.

show abstract

“…Considering this, we could use only a portion of the whole utterance to get the embedding from the acoustic encoder. This is a common practice in training speaker discriminative networks with speaker labels [9,10,11]. Thus, we explored multiple ways of training a SPN along this direction:…”

Section: Sampling Segments For Spn Inputsmentioning

confidence: 99%

Improving Reconstruction Loss Based Speaker Embedding in Unsupervised and Semi-Supervised Scenarios

Cho

Żelasko

Villalba

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Text-to-speech (TTS) models trained to minimize the spectrogram reconstruction loss can learn speaker embeddings without explicit speaker identity supervision, unlike x-vector speaker identification (SID) systems. Leveraging this way of speaker embedding learning can be useful in unsupervised or semi-supervised scenarios where non, or only some, of the training data have speaker labels. Thus, in this paper, we evaluate speaker embeddings learned by training the spectrogram prediction network under unsupervised and semisupervised scenarios. We experimented with different data sampling strategies. The best one was sampling two different segments from the same utterance, namely A and B, where the spectrogram of B is predicted given the B phone sequence and the speaker embedding extracted from A. This method improved by 3.4% relative in EER, compared to using the same utterance for both A and B without segmenting. In the unsupervised scenario, the best speaker embedding outperformed i-vectors, the state-of-the-art unsupervised speaker embedding, in speaker verification by 12.9% relative in EER. We observed high correlation between reconstruction loss and speaker embedding quality. In the semi-supervised scenario, having more unlabeled data in training led to a better performance in speaker verification. Adding 5314 unlabeled speakers to 800 labeled speakers improved EER by 10.8 % relative.

show abstract

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

Cited by 96 publications

References 24 publications

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Text-Independent Speaker Recognition Based on Adaptive Course Learning Loss and Deep Residual Network

Improving Reconstruction Loss Based Speaker Embedding in Unsupervised and Semi-Supervised Scenarios

Contact Info

Product

Resources

About