An Instrumental Quality Measure for Artificially Bandwidth-Extended Speech Signals

2022

Speech enhancement employing deep neural networks (DNNs) for denoising is called deep noise suppression (DNS). The DNS trained with mean squared error (MSE) losses cannot guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, therefore, cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end non-intrusive PESQNet DNN to estimate the PESQ scores of the enhanced speech signal. Thus, by providing a reference-free perceptual loss, it serves as a mediator towards the DNS training, allowing to maximize the PESQ score of the enhanced speech signal. We illustrate the potential of our proposed PESQNet-mediated training on a strong baseline DNS. As further novelty, we propose to train the DNS and the PESQNet alternatingly to keep the PESQNet up-todate and perform well specifically for the DNS under training. Detailed analysis shows that the PESQNet mediation further increases the DNS performance by about 0.1 PESQ points on synthetic test data and by 0.03 DNSMOS points on real test data, compared to training with the MSE-based loss. Our proposed method outperforms the Interspeech 2021 DNS Challenge baseline by 0.2 PESQ points on synthetic test data and 0.1 DNSMOS points on real test data. Furthermore, it improves on the same DNS trained with an approximated differentiable PESQ loss by about 0.4 PESQ points on synthetic test data and 0.2 DNSMOS points on real test data.

“…Following [9], [11] and [19], the performance of the PESQ-DNN and baseline models is measured by the mean absolute error (MAE)…”

Section: Performance Metricsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Strake

2022

“…To instrumentally evaluate the enhanced speech ŝ(n), the mean logarithmic spectral distance (LSD) averaged over frames is employed [61]. The LSD is calculated as…”

Section: E Metrics Of Speech Qualitymentioning

confidence: 99%

Convolutional Neural Networks to Enhance Coded Speech

Zhao

Liu

2019

Self Cite

Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors, is a challenging task. In this work we propose two postprocessing approaches applying convolutional neural networks (CNNs) either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time domain approach follows an end-to-end fashion, while the cepstral domain approach uses analysis-synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves speech quality (PESQ) by up to 0.25 MOS-LQO points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec (AMR-WB). In a subjective CCR listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to legacy G.711, even better than uncoded speech with statistical significance. The source code for the cepstral domain approach to enhance G.711-coded speech is made available 1 .

“…Regarding instrumental speech quality assessment, measures such as NB-PESQ [62], WB-PESQ [63], POLQA [64], or QABE [65] cannot be used for the presented LB-ABE approach, since these measures have not been developed for LB-ABE approaches. Still for information, Tab.…”

Section: A Instrumental Evaluationmentioning

confidence: 99%

Sinusoidal-Based Lowband Synthesis for Artificial Speech Bandwidth Extension

Abel

2019

Self Cite

Conventional narrowband (NB) telephony suffers from limited acoustic bandwidth at the receiver side, leading to degraded speech quality and intelligibility. In this paper, artificial speech bandwidth extension (ABE) of NB speech toward missing frequencies below about 300 Hz (low-frequency band, LB) is proposed to enhance the speech quality. The LB-ABE in this paper is employed together with a preexisting ABE toward high-frequency components to obtain spectrally balanced speech signals. In an instrumental quality assessment, the spectral distance in the LB was improved by more than 5 dB compared to NB speech. In a subjective listening test, the gap of speech quality between wideband and NB speech was significantly reduced when employing the proposed ABE toward low frequencies. The LB extension was found to further improve the preexisting ABE toward higher frequencies by a significant 0.26 CMOS points.