ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054254
|View full text |Cite
|
Sign up to set email alerts
|

Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement

Abstract: This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement. Specifically, we focus on a RNN that enhances short-time speech spectra on a single-frame-in, single-frame-out basis, a framework adopted by most classical signal processing methods. We propose two novel mean-squared-error-based learning objectives that enable separate control over the importance of speech disto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
51
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 83 publications
(52 citation statements)
references
References 31 publications
(51 reference statements)
0
51
1
Order By: Relevance
“…We compare the proposed complex-valued PF to a real-valued one. Its design is adapted from the noise suppression network proposed in [34]. This PF uses one dense layer as a feature extraction layer followed by two stacked GRU layers and a dense output layer.…”
Section: Resultsmentioning
confidence: 99%
“…We compare the proposed complex-valued PF to a real-valued one. Its design is adapted from the noise suppression network proposed in [34]. This PF uses one dense layer as a feature extraction layer followed by two stacked GRU layers and a dense output layer.…”
Section: Resultsmentioning
confidence: 99%
“…While this method non-surprisingly achieves only minor MOS improvement of around 0.12 on the test set including many highly nonstationary noise types, the achievement relative to its computational burden in the order of a few 1000 MACs is a tiny fraction of even our smallest models. For direct comparison of the following DNNbased baselines, we only took the architectures, but trained them using the same features, prediction targets, loss, and training procedure as for all other networks in this paper: i) NSnet [4] yields similar MOS as the classic NS. ii) The fully convolutional architecture proposed in [27] underperforms in this task, which we mainly suspect to the absence long-term temporal modeling as it uses only 8 frames temporal context.…”
Section: Resultsmentioning
confidence: 99%
“…Earlier network architectures were mainly recurrent neural network (RNN) structures, which were believed promising in terms of efficiency due to its efficient temporal modeling capabilities [2][3][4]. While such models seem to hit a performance saturation, the use of convolutional recurrent networks (CRNs) and convolutional neural networks (CNNs) raised the performance, but resulted in development of enormously large architectures [5][6][7][8] that are impractical to run on typical edge devices like consumer laptops, mobile phones, or even less powerful devices like wearables or hearing aids.…”
Section: Introductionmentioning
confidence: 99%
“…The STFT is first obtained using a 20-millisecond Hamming window with 50% overlap and a 512-point discrete Fourier transform. Then we take the natural logarithm of the power of the STFT and normalize the LPS with frequency-dependent online normalization following [22].…”
Section: Methodsmentioning
confidence: 99%