A Classification-Aided Framework for Non-Intrusive Speech Quality Assessment

Dong, Xuan; Williamson, Donald S.

doi:10.1109/waspaa.2019.8937192

Cited by 14 publications

(9 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It was also noticed in [22] that by minimizing the MSE the regression task may lead to prediction outliers and cause the model to overfit. By doing a very coarse quantization on real value PESQ labels, a classification loss was added to the MSE based regression loss of PESQ for punishing samples with large estimation errors.…”

Section: Recap Of Related Methodsmentioning

confidence: 99%

“…Meanwhile, such metric models are not differentiable, which limits their capability to work with other systems in a joint optimization manner. Inspired by the great success of deep learning, the deep neural networks have been developed to address the non-intrusive speech evaluation problem recently [5,[15][16][17][18][19][20][21][22][23]. In [17], an end-to-end and non-intrusive speech quality evaluation model, termed Quality-Net, was proposed based on bidirectional long short-term memory (BLSTM).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

The objective speech quality assessment is usually conducted by comparing received speech signal with its clean reference, while human beings are capable of evaluating the speech quality without any reference, such as in the mean opinion score (MOS) tests. Non-intrusive speech quality assessment has attracted much attention recently due to the lack of access to clean reference signals for objective evaluations in real scenarios. In this paper, we propose a novel non-intrusive speech quality measurement model, MetricNet, which leverages label distribution learning and joint speech reconstruction learning to achieve significantly improved performance compared to the existing nonintrusive speech quality measurement models. We demonstrate that the proposed approach yields promisingly high correlation to the intrusive objective evaluation of speech quality on clean, noisy and processed speech data.

show abstract

Section: Recap Of Related Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The kernel sizes of the maxpooling layers are set to 2 × 2. Note that this architecture is based on [16], since it performed well for a similar but different speech assessment task. We did, however, make modifications as discussed next.…”

Section: Feature Extraction and T60 Estimationmentioning

confidence: 99%

“…Since other speech-related tasks have shown that the MSE is sub-optimal, alternative loss functions for reverberation time estimation should be explored. Likewise, work in ASR [15] and speech assessment [16] have shown that treating speech-tasks as classification, rather than regression problems is beneficial. In this paper, we propose a composite classification and regression based loss function for estimating reverberation time for a variety of seen and unseen reverberant conditions.…”

Section: Introductionmentioning

confidence: 99%

On Loss Functions for Deep-Learning Based T60 Estimation

Liu

Williamson

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Reverberation time, T 60 , directly influences the amount of reverberation in a signal, and its direct estimation may help with dereverberation. Traditionally, T 60 estimation has been done using signal processing or probabilistic approaches, until recently where deep-learning approaches have been developed. Unfortunately, the appropriate loss function for training the network has not been adequately determined. In this paper, we propose a composite classification-and regression-based cost function for training a deep neural network that predicts T 60 for a variety of reverberant signals. We investigate pureclassification, pure-regression, and combined classificationregression based loss functions, where we additionally incorporate computational measures of success. Our results reveal that our composite loss function leads to the best performance as compared to other loss functions and comparison approaches. We also show that this combined loss function helps with generalization.

show abstract

“…In most cases, the primary approach is to train a model to predict objective (e.g., PESQ) and/or subjective (e.g., MOS) scores. However, generalization to unseen perturbations and tasks remains a concern [21], and most methods have not found wide-spread uses for SQA. Given that matching human subjective ratings is the ultimate motivation, some recent works try to train neural networks directly on MOS scores [20,14,1].…”

Section: Introductionmentioning

confidence: 99%

NORESQA: A Framework for Speech Quality Assessment using Non-Matching References

Manocha,

Xu,

Kumar

2021

Preprint

View full text Add to dashboard Cite

The perceptual task of speech quality assessment (SQA) is a challenging task for machines to do. Objective SQA methods that rely on the availability of the corresponding clean reference have been the primary go-to approaches for SQA. Clearly, these methods fail in real-world scenarios where the ground truth clean references are not available. In recent years, non-intrusive methods that train neural networks to predict ratings or scores have attracted much attention, but they suffer from several shortcomings such as lack of robustness, reliance on labeled data for training and so on. In this work, we propose a new direction for speech quality assessment. Inspired by human's innate ability to compare and assess the quality of speech signals even when they have non-matching contents, we propose a novel framework that predicts a subjective relative quality score for the given speech signal with respect to any provided reference without using any subjective data. We show that neural networks trained using our framework produce scores that correlate well with subjective mean opinion scores (MOS) and are also competitive to methods such as DNSMOS [1], which explicitly relies on MOS from humans for training networks. Moreover, our method also provides a natural way to embed quality-related information in neural networks, which we show is helpful for downstream tasks such as speech enhancement.Preprint. Under review.

show abstract

A Classification-Aided Framework for Non-Intrusive Speech Quality Assessment

Cited by 14 publications

References 28 publications

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment

On Loss Functions for Deep-Learning Based T60 Estimation

NORESQA: A Framework for Speech Quality Assessment using Non-Matching References

Contact Info

Product

Resources

About