2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2019
DOI: 10.1109/apsipaasc47483.2019.9023039
|View full text |Cite
|
Sign up to set email alerts
|

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

Abstract: Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intraclass compactness. As a result, the embeddings are not optimal for speaker recognition tasks. In this paper, to address this issue, three different margin … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
38
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 96 publications
(39 citation statements)
references
References 24 publications
(32 reference statements)
1
38
0
Order By: Relevance
“…The standard training criterion is therefore cross en- tropy. More discriminative criteria called angular softmax loss (Asoftmax) and their variants have recently been proposed and evaluated in [20,24,30]. The criteria considers angular margins between classes and is expected to produce more separable embedding representations.…”
Section: Classifiermentioning
confidence: 99%
See 1 more Smart Citation
“…The standard training criterion is therefore cross en- tropy. More discriminative criteria called angular softmax loss (Asoftmax) and their variants have recently been proposed and evaluated in [20,24,30]. The criteria considers angular margins between classes and is expected to produce more separable embedding representations.…”
Section: Classifiermentioning
confidence: 99%
“…Villalba et al summarized several state-of-the-art speaker recognition systems for the NIST SRE18 Challenge [16], where x-vector based systems [17] consistently outperformed i-vector based systems [18]. There has also been a surge of interest in new encoding methods and endto-end loss functions for speaker recognition [19,20,21,22,23,24,25]. One prominent advancement is the use of learnable dictionary encoding (LDE) [19] and angular softmax [20] for speaker recognition, which are reported to boost the speaker recognition performance on open-source corpora such as the VoxCelebs [26,27].…”
Section: Introductionmentioning
confidence: 99%
“…As shown in Table 7, Vox1, Vox1-E and Vox1-H denoted the VoxCeleb1, VoxCeleb1-E and VoxCeleb1-H test dataset, respectively. We used AAM-Softmax as the loss function [22]. Experimental results showed that our method had an improvement of 0.56%, 0.88% and 1.69% on the VoxCeleb1, VoxCeleb1-E and VoxCeleb1-H test dataset, respectively.…”
Section: Comparison and Analysismentioning
confidence: 99%
“…The database that supports the conclusions of this article is available in the [VoxCeleb [21,22] database] repository [Unique persistent identifier and hyperlink to the dataset at https://www.robots.ox.ac.uk/ vgg/data/voxceleb/ . ]…”
Section: Availability Of Data and Materialsmentioning
confidence: 99%
“…Considering this, we could use only a portion of the whole utterance to get the embedding from the acoustic encoder. This is a common practice in training speaker discriminative networks with speaker labels [9,10,11]. Thus, we explored multiple ways of training a SPN along this direction:…”
Section: Sampling Segments For Spn Inputsmentioning
confidence: 99%