ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053110
|View full text |Cite
|
Sign up to set email alerts
|

Feature Enhancement with Deep Feature Losses for Speaker Verification

Abstract: Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is crea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
25
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(25 citation statements)
references
References 23 publications
0
25
0
Order By: Relevance
“…They are used as the high level abstraction to measure the training loss between reconstructed signals and reference signals. Such training loss is also called deep feature loss [71], [72].…”
Section: Perceptual Loss For Style Reconstructionmentioning
confidence: 99%
“…They are used as the high level abstraction to measure the training loss between reconstructed signals and reference signals. Such training loss is also called deep feature loss [71], [72].…”
Section: Perceptual Loss For Style Reconstructionmentioning
confidence: 99%
“…Nevertheless, these methods are not task-specific, and the training data of speech enhancement and speaker embedding might belong to different domains. In [14], researchers evaluate and optimize the speech enhancement model based on perceptual loss, which is calculated by a pre-trained speaker embedding network. In [15,16,17], researchers connect and train the speech enhancement and speaker embedding networks in an end-to-end manner.…”
Section: Introductionmentioning
confidence: 99%
“…As the input of the ResNet network, three-channel (RGB) images were used, so the spectrograms were converted to RGB images. Another study in which MFCCs were used as the input of a DNN was developed to construct x-vector embeddings for speaker verification tasks [ 40 ]. In addition, in several works, the TDNNs were exploited to recognize speech [ 41 ], emotion [ 42 ], speaker, or voice activity.…”
Section: Introductionmentioning
confidence: 99%