Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition

Novoselov, Sergey; Shchemelinin, Vadim; Shulipa, Andrey; Kozlov, Alexander; Kremnev, Ivan

doi:10.21437/interspeech.2018-1209

Cited by 50 publications

(32 citation statements)

References 28 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results demonstrated that the proposed pAUC deep embedding is highly competitive in performance with the state-of-the-art identification loss based deep embedding methods with the Softmax and ArcSoftmax output units. Note that a very recent work proposed at the same time as our work in [23] maximizes the area under the ROC curve (AUC) for text-dependent speaker verification [20]. It can be shown that AUC is a particular case of pAUC and experimental results show the pAUC deep embedding outperforms the AUC deep embedding significantly.…”

Section: Introductionmentioning

confidence: 82%

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Bai

Zhang

Chen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep embedding based text-independent speaker verification has demonstrated superior performance to traditional methods in many challenging scenarios. Its loss functions can be generally categorized into two classes, i.e., verification and identification. The verification loss functions match the pipeline of speaker verification, but their implementations are difficult. Thus, most state-of-the-art deep embedding methods use the identification loss functions with softmax output units or their variants. In this paper, we propose a verification loss function, named the maximization of partial area under the Receiver-operating-characteristic (ROC) curve (pAUC), for deep embedding based text-independent speaker verification. We also propose a class-center based training trial construction method to improve the training efficiency, which is critical for the proposed loss function to be comparable to the identification loss in performance. Experiments on the Speaker in the Wild (SITW) and NIST SRE 2016 datasets show that the proposed pAUC loss function is highly competitive with the state-of-the-art identification loss functions.

show abstract

Section: Introductionmentioning

confidence: 82%

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Bai

Zhang

Chen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…where s k is the score for trial k given by Equation (2), σ is the sigmoid function, and α and β are the calibration parameters, trained to minimize the quantity in Equation (3). To summarize, Equations (1), (2) and (5) show the pipeline that is applied to the embeddings in the standard PLDA-based backend. The parameters involved in these equations are all trained separately, freezing the parameters of the previous steps in order to obtain input data to train the next step.…”

Section: Standard Plda-based Backendmentioning

confidence: 99%

“…We propose a backend with the same functional form as the PLDAbackend explained in the previous section, but where all parameters are optimized jointly, in a manner similar to the one used in [6] (though, note that in this paper we only optimize jointly up to the backend stage instead of the full pipeline, as in Rohdin's paper). We first initialize all parameters in Equations (1), (2) and (5) as in the standard PLDA-based backend. Then, we fine tune the parameters to optimize the cross-entropy in Equation (3) using some variant of stochastic gradient descent.…”

Section: Proposed Discriminative Backendmentioning

confidence: 99%

“…This problem has been widely acknowledged in the community and several papers have been published with different attempts to eliminate or integrate some of these stages. Angular and triplet losses [2,3,4,5] have been used to train the speaker embedding extractor DNN instead of the standard cross-entropy with the goal of making the backend stages unnecessary. The cosine distance between embeddings resulting from these losses can be directly used to generate a score for each trial without the need for a separate backend.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Discriminative Condition-Aware Backend for Speaker Verification

Ferrer

McLaren

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a scoring approach for speaker verification that mimics the standard PLDA-based backend process used in most current speaker verification systems. However, unlike the standard backends, all parameters of the model are jointly trained to optimize the binary cross-entropy for the speaker verification task. We further integrate the calibration stage inside the model, making the parameters of this stage depend on metadata vectors that represent the conditions of the signals. We show that the proposed backend has excellent outof-the-box calibration performance on most of our test sets, making it an ideal approach for cases in which the test conditions are not known and development data is not available for training a domainspecific calibration model.

show abstract

“…It has quite a few applications including noise cancelling, audio editing, preprocessing for speech recognition, just to name a few. Denote the noisy speech as y(t), we have y(t) = x(t) + n(t) (1) where x(t) and n(t) are respectively the clean speech and the noise, with t being the time index. Speech enhancement tries to recover the clean speech x 1 from the noisy speech y.…”

Section: Introductionmentioning

confidence: 99%

Coarse-to-Fine Optimization for Speech Enhancement

Yao¹,

Al-Dahle²

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we propose the coarse-to-fine optimization for the task of speech enhancement. Cosine similarity loss [1] has proven to be an effective metric to measure similarity of speech signals. However, due to the large variance of the enhanced speech with even the same cosine similarity loss in high dimensional space, a deep neural network learnt with this loss might not be able to predict enhanced speech with good quality. Our coarse-to-fine strategy optimizes the cosine similarity loss for different granularities so that more constraints are added to the prediction from high dimension to relatively low dimension. In this way, the enhanced speech will better resemble the clean speech. Experimental results show the effectiveness of our proposed coarse-to-fine optimization in both discriminative models and generative models. Moreover, we apply the coarse-tofine strategy to the adversarial loss in generative adversarial network (GAN) and propose dynamic perceptual loss, which dynamically computes the adversarial loss from coarse resolution to fine resolution. Dynamic perceptual loss further improves the accuracy and achieves state-of-the-art results compared with other generative models.

show abstract

Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition

Cited by 50 publications

References 28 publications

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

A Discriminative Condition-Aware Backend for Speaker Verification

Coarse-to-Fine Optimization for Speech Enhancement

Contact Info

Product

Resources

About