DNN No-Reference PSTN Speech Quality Prediction

Mittag, Gabriel; Cutler, Ross; Hosseinkashi, Yasaman; Revow, Michael; Srinivasan, Sriram; Chande, Naglakshmi; Aichner, Robert

doi:10.21437/interspeech.2020-2760

Cited by 15 publications

(7 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…training set file was rated by 5 participants, while the test set files were rated by 30 participants to ensure a low confidence interval of the MOS values for the model evaluation. For more details, please refer to [30].…”

Section: Pstn Corpusmentioning

confidence: 99%

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Yi¹,

Xiao²,

Xiao³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Section: Pstn Corpusmentioning

confidence: 99%

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Yi¹,

Xiao²,

Xiao³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

“…Such datasets include more than 200 hours of speech samples degraded with common degradations experienced over conferencing applications plus speech synthesis and voice conversion samples. This study considered the synthetised speech dataset VoiceMOS [7], plus four sets of datasets with speech conferencing distortions, namely: Tencent (2 datasets), NISQA (7 datasets) [8], IU-Bloomington (2 datasets) [13,14] and PSTN [15]. These datasets were used to build subsets to fine-tune the wav2vec 2.0 pre-trained model targeting specific speech scenarios.…”

Section: Datasetsmentioning

confidence: 99%

“…The experiment seeks to understand the level of influence that fine-tuning data size has on speech quality predictions. Labelled speech datasets containing conferencing degradations (Tencent, IU Bloomington [13,14], NISQA Corpus [8], and PSTN [15]) and synthesised speech (VoiceMOS [7]) were used to build the fine-tuning datasets and test the resulting models.…”

Section: Introductionmentioning

confidence: 99%

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Martinez¹,

Ragano²,

Hines³

2022

Preprint

View full text Add to dashboard Cite

Recent studies have shown how self-supervised models can produce accurate speech quality predictions. Speech representations generated by the pre-trained wav2vec 2.0 model allows constructing robust predicting models using small amounts of annotated data. This opens the possibility of developing strong models in scenarios where labelled data is scarce. It is known that fine-tuning improves the model's performance; however, it is unclear how the data (e.g., language, amount of samples) used for fine-tuning is influencing that performance. In this paper, we explore how using different speech corpus to fine-tune the wav2vec 2.0 can influence its performance. We took four speech datasets containing degradations found in common conferencing applications and fine-tuned wav2vec 2.0 targeting different languages and data size scenarios. The fine-tuned models were tested across all four conferencing datasets plus an additional dataset containing synthetic speech and they were compared against three external baseline models. Results showed that fine-tuned models were able to compete with baseline models. Larger fine-tune data guarantee better performance; meanwhile, diversity in language helped the models deal with specific languages. Further research is needed to evaluate other wav2vec 2.0 models pre-trained with multi-lingual datasets and to develop prediction models that are more resilient to language diversity.

show abstract

“…As machine learning (ML) has become more powerful and accessible, numerous research groups have sought to apply ML to develop NR tools [17]- [50]. Some of these NR tools produce estimates of subjective test scores that report speech or sound quality mean opinion score (MOS) [17]- [19], [24]- [27], [30], [35], [38], [40], [41], [46], [47], naturalness [28], [34], [36], listening effort [23], noise intrusiveness [47], and speech intelligibility [20], [32]. The non-intrusive speech quality assessment model called NISQA [50] produces estimates of subjective speech quality as well as four constituent dimensions: noisiness, coloration, discontinuity, and loudness.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier¹,

Voran²

2022

Preprint

View full text Add to dashboard Cite

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require “reference” (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

DNN No-Reference PSTN Speech Quality Prediction

Cited by 15 publications

References 31 publications

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Contact Info

Product

Resources

About