Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Xu, Ziyi; Strake, Maximilian; Fingscheidt, Tim

doi:10.1109/taslp.2022.3165442

Cited by 9 publications

(11 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both versions of DNSMOS require the input speech signals having a fixed length of nine seconds. In our recent works [21], [22], [24], we proposed an end-to-end PESQNet for DNS applications, adapted from a BLSTM-based speech emotion recognition DNN [35], to predict PESQ scores of the enhanced speech signal. In these works, the trained PESQNet is employed as a mediator to provide a differentiable PESQ loss during a speech enhancement DNN training, aiming at maximizing the PESQ score of the enhanced speech signal.…”

Section: Introductionmentioning

confidence: 99%

“…As concerns topology, we build upon [22], but many changes are required for the DNN to serve the speech communication monitoring needs targeted in this work: (1) Compared to PESQNet, the novel PESQ-DNN employs a complex spectrogram as input to explicitly consider phase influences in the perceived speech quality. Except for a few works, e.g., WaweNet, most speech quality prediction DNNs employ amplitude or power spectrogram input, leading to the problem that speech quality degradations caused by phase distortions cannot be measured.…”

Section: Introductionmentioning

confidence: 99%

“…The number of input channels is set to either C 1 or C 2 for the PESQ-DNN employing the amplitude or complex spectrogram as input, respectively. The "Embedding Processing" model used in the baseline PESQNet [22] is recapitulated in Fig. 2, while the employed model for our proposed PESQ-DNN containing frame-level or block-level embeddings is illustrated in Fig.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Coded Speech Quality Measurement by a Non-Intrusive PESQ-DNN

Xu,

Zhao,

Fingscheidt

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Wideband codecs such as AMR-WB or EVS are widely used in (mobile) speech communication. Evaluation of coded speech quality is often performed subjectively by an absolute category rating (ACR) listening test. However, the ACR test is impractical for online monitoring of speech communication networks. Perceptual evaluation of speech quality (PESQ) is one of the widely used metrics instrumentally predicting the results of an ACR test. However, the PESQ algorithm requires an original reference signal, which is usually unavailable in network monitoring, thus limiting its applicability. NISQA is a new nonintrusive neural-network-based speech quality measure, focusing on super-wideband speech signals. In this work, however, we aim at predicting the well-known PESQ metric using a non-intrusive PESQ-DNN model. We illustrate the potential of this model by predicting the PESQ scores of wideband-coded speech obtained from AMR-WB or EVS codecs operating at different bitrates in noisy, tandeming, and error-prone transmission conditions. We compare our methods with the state-of-the-art network topologies of QualityNet, WaweNet, and DNSMOS-all applied to PESQ prediction-by measuring the mean absolute error (MAE) and the linear correlation coefficient (LCC). The proposed PESQ-DNN offers the best total MAE and LCC of 0.11 and 0.92, respectively, in conditions without frame loss, and still is best when including frame loss. Note that our model could be similarly used to non-intrusively predict POLQA or other (intrusive) metrics. The proposed PESQ-DNN model definition and the code are provided at https://github.com/ifnspaml/PESQDNN.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coded Speech Quality Measurement by a Non-Intrusive PESQ-DNN

Xu,

Zhao,

Fingscheidt

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Machine-learning-based methods have been proposed to eliminate the dependence on clean speech references during inference and can be further divided into two categories. The first attempts to non-intrusively estimate the objective scores mentioned above (Fu et al, 2018;Dong & Williamson, 2020;Zezario et al, 2020;Catellier & Voran, 2020;Yu et al, 2021b;Xu et al, 2022;Kumar et al, 2023). However, during training, noisy/processed and clean speech pairs are still required to obtain the objective scores as model targets.…”

Section: Introductionmentioning

confidence: 99%

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Hung

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech quality estimation has recently undergone a paradigm shift from humanhearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantizedvariational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.

show abstract

“…However, such objective functions must be carefully designed as many objective measures contain calculations that are non-differentiable. Several systems circumvent this limitation via use of an additional model that mimics the behaviour of the metric [12]- [14], with this network being used as a surrogate of the metric used as an objective function in training of the speech enhancement model. The baseline system that this work builds upon is one such system, MetricGAN+ [15] (itself an extension of previous work MetricGAN [16]).…”

Section: Introductionmentioning

confidence: 99%