Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1924
|View full text |Cite
|
Sign up to set email alerts
|

Speech Denoising with Deep Feature Losses

Abstract: We present an end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly. Given input audio containing speech corrupted by an additive background signal, the system aims to produce a processed signal that contains only the speech content. Recent approaches have shown promising results using various deep network architectures. In this paper, we propose to train a fully-convolutional context aggregation network using a deep feature loss. That loss is based on comparing … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
71
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 96 publications
(73 citation statements)
references
References 22 publications
2
71
0
Order By: Relevance
“…An important requirement in DNN-based speech enhancement and separation is generalization that means working for any speaker. To achieve this, in speech enhancement, several studies train a global M using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Unfortunately, in speech separation, generalization cannot be achieved solely using a large scale training dataset because there is no way of knowing which signal in the speech-mixture is the target.…”
Section: Auxiliary Speaker-aware Feature For Speech Separationmentioning
confidence: 99%
See 3 more Smart Citations
“…An important requirement in DNN-based speech enhancement and separation is generalization that means working for any speaker. To achieve this, in speech enhancement, several studies train a global M using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Unfortunately, in speech separation, generalization cannot be achieved solely using a large scale training dataset because there is no way of knowing which signal in the speech-mixture is the target.…”
Section: Auxiliary Speaker-aware Feature For Speech Separationmentioning
confidence: 99%
“…Trained using other 27 speakers' samples, that is, the speaker independent model (the scope of conventional studies). We used the perceptual evaluation of speech quality (PESQ), CSIG, CBLK, and COVL as the performance metrics which are the standard metrics for this dataset [4,8,9,12]. The three composite measures CSIG, CBAK, and COVL are the popular predictor of the mean opinion score (MOS) of the target signal distortion, background noise interference, and overall speech quality, respectively [30].…”
Section: Openmentioning
confidence: 99%
See 2 more Smart Citations
“…In style transfer, the perceptual loss ensures that the high-level contents of an image remain the same, while allowing the texture of the image to change. For speech-related tasks a perceptual loss has been used to denoise time-domain speech data [8], where the loss was called a "deep feature loss". The perceiving network was trained for acoustic environment detection and domestic audio tagging.…”
Section: Perceptual Lossmentioning
confidence: 99%