The Speaker and Language Recognition Workshop (Odyssey 2020) 2020
DOI: 10.21437/odyssey.2020-65
|View full text |Cite
|
Sign up to set email alerts
|

Robust Speaker Recognition Using Speech Enhancement And Attention Model

Abstract: In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. It aims to improve speaker recognition performance when speech signals are corrupted by noise. Instead of separately processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. Furthermore, to increase the robustness against noise, a multi-stage attention mechanism is employed to highlight… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(7 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…In [14], researchers evaluate and optimize the speech enhancement model based on perceptual loss, which is calculated by a pre-trained speaker embedding network. In [15,16,17], researchers connect and train the speech enhancement and speaker embedding networks in an end-to-end manner. Both perceptual loss and endto-end training optimize the speech enhancement network with the target of improving speaker verification performance instead of decreasing the regression error between enhanced and clean features.…”
Section: Introductionmentioning
confidence: 99%
“…In [14], researchers evaluate and optimize the speech enhancement model based on perceptual loss, which is calculated by a pre-trained speaker embedding network. In [15,16,17], researchers connect and train the speech enhancement and speaker embedding networks in an end-to-end manner. Both perceptual loss and endto-end training optimize the speech enhancement network with the target of improving speaker verification performance instead of decreasing the regression error between enhanced and clean features.…”
Section: Introductionmentioning
confidence: 99%
“…Particularly, in a single-microphone setup, a speech enhancement front-end may lead to performance degradation of ASV systems unless carefully designed or optimized [8]- [10]. Therefore, several task-specific optimization (TSO) methods have been proposed to build DNN-based singlechannel front-ends [9]- [11], wherein the DNNs are trained for acoustic feature enhancement [9] or spectral masking [10] by a task-specific method, by exploiting a pretrained DSE model. In a similar context, in [11], a denoising DNN was jointly optimized with a pretrained DSE model.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, several task-specific optimization (TSO) methods have been proposed to build DNN-based singlechannel front-ends [9]- [11], wherein the DNNs are trained for acoustic feature enhancement [9] or spectral masking [10] by a task-specific method, by exploiting a pretrained DSE model. In a similar context, in [11], a denoising DNN was jointly optimized with a pretrained DSE model. Concurrently, the standard acoustic features were substituted by the bottleneck features learned from a DNN, which was adversarially trained to classify noise types [12].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations