Analysis of trade-offs between magnitude and phase estimation in loss functions for speech denoising and dereverberation

Luo, Xiaoxue; Zheng, Chengshi; Li, Andong; Ke, Yuxuan; Li, Xiaodong

doi:10.1016/j.specom.2022.10.003

Cited by 9 publications

(5 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second benefit is that speech distortion and noise reduction can be better balanced when compared with the raw magnitude without compression, resulting in improving speech quality. This may be the case because compression reduces the dynamic range of the magnitude values, facilitating the training process (Luo et al, 2022). Compression of the magnitude of the noisy spectrum can be expressed by: where α cp ∈ false( 0 0.25em 1 false] is the compression factor.…”

Section: Deep Learning Methodsmentioning

confidence: 99%

“…When the complex spectrum-based MSE loss function is used, the phase estimation error is reduced but spectral magnitude distortion increases. The trade-off between spectral magnitude distortion and phase recovery has been called the “compensation effect” (Wang et al, 2021; Luo et al, 2022). To reduce both magnitude and phase distortion, a combined loss function has been proposed, which is formulated as: where α com is a linear combination coefficient.…”

Section: Deep Learning Methodsmentioning

confidence: 99%

“…Various easily extracted features of noisy speech have been used in deep neural network (DNN) models designed to extract clean speech from noisy speech, including the LOG-AMP, the log-power spectrum (Xu et al, 2014b(Xu et al, , 2015, spectral amplitudes (Tan & Wang, 2018) and the spectral amplitudes raised to a power less than 1 (Zhao et al, 2020), which represents a form of amplitude compression. The cube-root of the spectral amplitudes generally led to the best performance, perhaps because taking the cube-root reduces the dynamic range of the speech, facilitating the training process (Luo et al, 2022). Tan & Wang (2020) extracted the real and imaginary parts of the complex spectrum of noisy speech as input features.…”

Section: Feature Extractionmentioning

confidence: 99%

See 2 more Smart Citations

Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods

Zheng,

Zhang,

Liu

et al. 2023

Trends in Hearing

Self Cite

View full text Add to dashboard Cite

Frequency-domain monaural speech enhancement has been extensively studied for over 60 years, and a great number of methods have been proposed and applied to many devices. In the last decade, monaural speech enhancement has made tremendous progress with the advent and development of deep learning, and performance using such methods has been greatly improved relative to traditional methods. This survey paper first provides a comprehensive overview of traditional and deep-learning methods for monaural speech enhancement in the frequency domain. The fundamental assumptions of each approach are then summarized and analyzed to clarify their limitations and advantages. A comprehensive evaluation of some typical methods was conducted using the WSJ + Deep Noise Suppression (DNS) challenge and Voice Bank + DEMAND datasets to give an intuitive and unified comparison. The benefits of monaural speech enhancement methods using objective metrics relevant for normal-hearing and hearing-impaired listeners were evaluated. The objective test results showed that compression of the input features was important for simulated normal-hearing listeners but not for simulated hearing-impaired listeners. Potential future research and development topics in monaural speech enhancement are suggested.

show abstract

Section: Deep Learning Methodsmentioning

confidence: 99%

Section: Deep Learning Methodsmentioning

confidence: 99%

Section: Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods

Zheng,

Zhang,

Liu

et al. 2023

Trends in Hearing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Loss function: To partially avoid the compensation effect between the magnitude and RI constraints in [28] and [29], we use the linear combination of magnitude and complex loss:…”

Section: Magnitude Decodermentioning

confidence: 99%

“…To partially avoid the compensation effect between the magnitude and RI constraints in [28] and [29], we use the linear combination of magnitude and complex loss:

\begin{equation} {\cal L} =\frac{{{{\cal L}_{\text{mag}}} + {{\cal L}_{ri}}}}{2} \end{equation}

\begin{equation} {{\cal L}_{\text{mag}}}={{\mathbb {E}}_{{S_{\text{mag}}},{{\hat{S}}_{\text{mag}}}}}{\left[ {{{{\left\Vert {{S_{\text{mag}}} - {{\hat{S}}_{\text{mag}}}} \right\Vert} }^2}} \right]} \end{equation}

\begin{equation} {{\cal L}_{ri}} ={{\mathbb {E}}_{{S_r},{{\hat{S}}_r}}}{\left[ {{{{\left\Vert {{S_r} - {{\hat{S}}_r}} \right\Vert} }^2}} \right]} + {{\mathbb {E}}_{{S_i},{{\hat{S}}_i}}}{\left[ {{{{\left\Vert {{S_i} - {{\hat{S}}_i}} \right\Vert} }^2}} \right]} \end{equation}

…”

Section: Loss Functionmentioning

confidence: 99%

Acoustic echo cancellation based on two‐stage BLSTM

Niu,

Ou,

Song

et al. 2024

Electronics Letters

View full text Add to dashboard Cite

Acoustic echo cancellation (AEC) methods aim to suppress the acoustic coupling for hands‐free speech communication. Traditional AEC works by identifying the acoustic impulse response using adaptive algorithms. With recent research advances, deep learning has become an attractive choice for AEC. This paper introduces a two‐stage bidirectional long short term memory (TS‐BLSTM) framework, incorporating multi‐head self‐attention mechanisms after each BLSTM block. This is aimed at better capturing contextual information and further enhancing ability of the model to handle complex acoustic scenarios. The BLSTM blocks are utilized to aggregate magnitude spectrum information, modelling both time and frequency dependencies. Additionally, dilation convolution is introduced to broaden the range of information in each convolution output. The magnitude decoder estimates a mask for the input, resulting in the generation of an estimated magnitude spectrum for near‐end speech. Experimental results indicate that the proposed method achieves promising outcomes.

show abstract

MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain

et al. 2023

View full text Add to dashboard Cite

Analysis of trade-offs between magnitude and phase estimation in loss functions for speech denoising and dereverberation

Cited by 9 publications

References 34 publications

Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods

Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods

Acoustic echo cancellation based on two‐stage BLSTM

MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain

Contact Info

Product

Resources

About