Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks

Mittag, Gabriel; Möller, Sebastian

doi:10.1109/icassp40776.2020.9053951

Cited by 12 publications

(7 citation statements)

References 14 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a next step, the pretraining database could be improved further with an even wider variety of different speakers and conditions that are more similar to TTS distortions. Also, the full-reference speech quality model presented in [19], which automatically aligns the reference to the degraded signal, could be used to estimate the similarity between original and synthesized/voice-conversed speakers.…”

Section: Discussionmentioning

confidence: 99%

“…We use the pretraining database from [19] to first build a speech quality prediction network that is trained on speech communication network degradation. We then use the speech quality prediction domain knowledge of the neural network to im-1 github.com/gabrielmittag/NISQA prove the reliability of synthesized speech naturalness prediction through transfer learning.…”

Section: Speech Quality Pretrainingmentioning

confidence: 99%

“…The model is the same CNN-LSTM network that has been used for single-ended speech quality estimation in [19]. At first melspectrograms are calculated from the speech waveform.…”

Section: Modelmentioning

confidence: 99%

See 2 more Smart Citations

Deep Learning Based Assessment of Synthetic Speech Naturalness

Mittag

Möller

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Speech Quality Pretrainingmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning Based Assessment of Synthetic Speech Naturalness

Mittag

Möller

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…CNN-LSTM is a neural network structure that combines CNN and LSTM and it has been recently used for speech quality assessment [2,7,25]. In this structure, CNNs extract deep features of speech and the CNN feature vectors are then used as input for LSTM network that models time dependencies, which means that CNN-LSTM has the advantages of both CNN and LSTM.…”

Section: Cnn-lstmmentioning

confidence: 99%

Neural network-based non-intrusive speech quality assessment using attention pooling function

Liu

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.

show abstract

“…The speech quality prediction model used in this paper is a narrowband (up to 4 kHz) version of the CNN-LSTM neural network presented in [11,25] 3 . Instead of using mel-spectrograms with a maximum frequency of 16 kHz and 48 mel bands, the mel spectrogram inputs to the proposed PSTN model have a maximum frequency of 4 kHz and 32 mel bands.…”

Section: Model Descriptionmentioning

confidence: 99%

DNN No-Reference PSTN Speech Quality Prediction

et al. 2020

Self Cite

View full text Add to dashboard Cite

Classic public switched telephone networks (PSTN) are often a black box for VoIP network providers, as they have no access to performance indicators, such as delay or packet loss. Only the degraded output speech signal can be used to monitor the speech quality of these networks. However, the current state-of-the-art speech quality models are not reliable enough to be used for live monitoring. One of the reasons for this is that PSTN distortions can be unique depending on the provider and country, which makes it difficult to train a model that generalizes well for different PSTN networks. In this paper, we present a new open-source PSTN speech quality test set with over 1000 crowdsourced real phone calls. Our proposed noreference model outperforms the full-reference POLQA and noreference P.563 on the validation and test set. Further, we analyzed the influence of file cropping on the perceived speech quality and the influence of the number of ratings and training size on the model accuracy.

show abstract

Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks

Cited by 12 publications

References 14 publications

Deep Learning Based Assessment of Synthetic Speech Naturalness

Deep Learning Based Assessment of Synthetic Speech Naturalness

Neural network-based non-intrusive speech quality assessment using attention pooling function

DNN No-Reference PSTN Speech Quality Prediction

Contact Info

Product

Resources

About