Machine learning based non-intrusive quality estimation with an augmented feature set

Hakami, Mona; Kleijn, W. Bastiaan

doi:10.1109/icassp.2017.7953129

Cited by 12 publications

(65 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Haemin et al [4] proposed a deep neural network (DNN) based non-intrusive speech quality estimation method in real-time voice communication systems. Hakami and Kleijn [5] used augmented feature set and the neural network to improve the prediction accuracy of the single-ended quality assessment approach. Quality-Net [6], based on bidirectional long short term memory (BLSTM), combined the frame-level scores to the final estimated utterance-level quality score using average pooling method.…”

Section: Related Workmentioning

confidence: 99%

“…It was proposed relatively early and its accuracy is far from intrusive methods. With the rapid development of *Correspondence: wangjing@bit.edu.cn 1 School of Information and Electronics, Beijing Institute of Technology, Beijing, China Full list of author information is available at the end of the article deep learning technology, many researchers have applied deep neural networks to speech quality assessment [4][5][6][7], which greatly improved the accuracy of non-intrusive methods. But none of them paid attention to the pooling function before the output of neural networks in speech quality assessment task.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Neural network-based non-intrusive speech quality assessment using attention pooling function

Liu

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Neural network-based non-intrusive speech quality assessment using attention pooling function

Liu

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…The underlying random features in X are independent. Without loss of generality, we assume R X = I d×d (so it sets the scale), R U = hI d×d , and R W = gI t×t , where d and t are the dimensionality of X and Y respectively, and g and h are small [1]. These assumptions led to…”

Section: Model Behaviour For Redundant Featuresmentioning

confidence: 99%

“…As proposed in the previous section, using a large number of features is beneficial for better performance. The usage of a large number of features naturally leads to the inclusion of features that have poor behaviour [1].…”

Section: Pre-processing Featuresmentioning

confidence: 99%

“…As discussed in Section 2.2, the non-intrusive quality estimation P.563 and ANIQUE+ are the two existing standards and naturally form an excellent reference for our work. Therefore, we [1] built our input vector so that it contains both the features extracted from P.563 and ANIQUE+. Since P.563 and ANIQUE+ are designed for for narrowband speech, our system requires to downsample the speech files to 8 kHz if they are wideband.…”

Section: Enhanced Feature Set For Quality Estimationmentioning

confidence: 99%

See 1 more Smart Citation

Machine Learning for Non-Intrusive Speech Quality Assessment

Hakami¹

View full text Add to dashboard Cite

This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment. To outperform existing methods, we concentrate on enhancing the feature set. In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data. Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality. The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score. We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective. We propose an augmented feature set that includes raw features that are presumably redundant. The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input. Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions. We introduce a new pre-processing method and re-distribute the features to facilitate the training. The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature. The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals. Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples. The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech. The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial. To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space. Generative models that map simple latent distributions into complex distributions are excellent platforms for our work. Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech. Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space. We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech. Generative adversarial nets (GAN) are an effective training method and work well in this application. The proposed system is a novel application for a GAN. The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.

show abstract

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier,

Voran

2023

IEEE Access

View full text Add to dashboard Cite

Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality and intelligibly. Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ''reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values with per-segment correlations in the range of 0.92 to 0.96. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves a per-segment correlation of 0.97. The performance of our WAWEnet architecture compares favorably to models with orders-of-magnitude more parameters and computational complexity. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

Machine learning based non-intrusive quality estimation with an augmented feature set

Cited by 12 publications

References 13 publications

Neural network-based non-intrusive speech quality assessment using attention pooling function

Neural network-based non-intrusive speech quality assessment using attention pooling function

Machine Learning for Non-Intrusive Speech Quality Assessment

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Contact Info

Product

Resources

About