ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053407
|View full text |Cite
|
Sign up to set email alerts
|

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Abstract: Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 38 publications
(21 citation statements)
references
References 28 publications
1
19
0
Order By: Relevance
“…Data augmentation is proven to be an effective strategy for both conventional learning with supervision [24] and contrastive selfsupervision learning [7,8,15] in the context of deep learning. We perform data augmentation with MUSAN dataset [25].…”
Section: Data Augmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…Data augmentation is proven to be an effective strategy for both conventional learning with supervision [24] and contrastive selfsupervision learning [7,8,15] in the context of deep learning. We perform data augmentation with MUSAN dataset [25].…”
Section: Data Augmentationmentioning
confidence: 99%
“…We use the same network architecture as in [24]. ReLU nonlinear activation and batch normalization are applied to each convolutional layer in ResNet.…”
Section: Contrastive Self-supervised Learning Setupmentioning
confidence: 99%
“…Channel-invariant training. Inspired by [19], we propose channel-invariant loss which is used to force the embedding of augmented segments as similar as its clean version, preventing the network from encoding the undesired channel information into the speaker representation. The model can learn to filter out channel factors.…”
Section: Remove Channel Informationmentioning
confidence: 99%
“…For input-level, models usually can be adapted by training with enhanced [8] or domain-translated [9] input features. For adaptation at embedding-level, it often targets at minimizing certain distances between source and target domains to align them in the same embedding space, such as cosine distance [10], mean squared error (MSE) [11], and maximum mean discrepancy (MMD) [12]. However, this method usually requires parallel or artificial simulated data, which cannot generalize well to real-world scenarios.…”
Section: Introductionmentioning
confidence: 99%