Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Bastiaan, Tamm,; Balabin, Helena; Vandenberghe, Rik; hamme, Hugo Van

doi:10.21437/interspeech.2022-10147

Cited by 6 publications

(7 citation statements)

References 17 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another recent challenge [24] focused on predicting speech quality in speech conferencing applications, and also saw several submissions, e.g. [25,26], making use of SSL representations. This task does involve spontaneous speech audio, but focuses only on assessing quality of speech transmission in online conferencing and not on asessing synthesized spontaneous speech from a TTS model.…”

Section: Quality Prediction Using Ssl Modelsmentioning

confidence: 99%

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Wang,

Henter,

Gustafson

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speechtechnology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-tospeech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its performance on synthesized spontaneous speech. All experiments are conducted twice on two different spontaneous corpora in order to find generalizable trends. Overall, we present comprehensive experimental results on the use of SSL in spontaneous TTS and MOS prediction to further quantify and understand how SSL can be used in spontaneous TTS. Audios samples: https: //www.speech.kth.se/tts-demos/sp_ssl_tts

show abstract

Section: Quality Prediction Using Ssl Modelsmentioning

confidence: 99%

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Wang,

Henter,

Gustafson

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

show abstract

“…SSSRs have been applied to metric prediction tasks, typically to quality prediction [22,23]. In [13], XLSR representations are used as feature extraction in a non-intrusive human MOS prediction network. Similarly, in [24] SSSRs are used for the same quality prediction task, but they are fine-tuned with a mean pooling layer rather than being used simply as feature extraction.…”

Section: Sssrs For Metric Predictionmentioning

confidence: 99%

“…A model structure inspired by [13] is chosen for the SI prediction network. Five feature extraction methods are used; outputs of G FE and G OL for both, HuBERT and XLSR representations, as well as a spectrogram representation denoted as SPEC.…”

Section: Model Structure and Experiments Setupmentioning

confidence: 99%

“…It is understood that SSSRs are able to encode and predict the context of the speech content in the input audio, and thus model the patterns of spoken language. Recent work [12][13][14][15][16] has found that in addition to speech content, SSSRs are also able to capture information on potentially corrupting noise and distortion in the input audio. In this work, SSSRs are used as a feature transformation for non-intrusive neural speech intelligibility prediction networks, trained on the CPC1 challenge dataset.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals

Close¹,

Hollands²,

Goetze³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

This paper proposes neural models to predict Speech Intelligibility (SI),both by prediction of established SI metrics and of human speech recognition (HSR) on the 1st Clarity Prediction Challenge. Both intrusive and non-intrusive predictors for intrusive SI metrics are trained, then fine-tuned on the HSR ground truth. Results are reported on a number of SI metrics, and the model choice for the Clarity challenge submission is explained. Additionally, the relationship between the SI scores in the data and commonly used signal processing metrics which approximate SI are analysed, and some issues emerging from this relationship discussed. It is found that intrusive neural predictors of SI metrics when fine-tuned on the true HSR scores outperform the non neural challenge baseline.

show abstract

“…It is demonstrated the encoder layers have a notably stronger correlation to the aforementioned evaluation measures than the output layers. Hidden unit BERT (HuBERT) [12] and XLSR [14] SSSRs are chosen as these have both been applied in related speech tasks previously but in different ways [15,16]. Following from this, the distances between clean and noisy SSSR features are then evaluated for their usefulness as loss functions to train speech enhancement models.…”

Section: Introductionmentioning

confidence: 99%

Perceive and Predict: Self-Supervised Speech Representation Based Loss Functions for Speech Enhancement

Ravenscroft

Hain

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).

show abstract

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Cited by 6 publications

References 17 publications

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals

Perceive and Predict: Self-Supervised Speech Representation Based Loss Functions for Speech Enhancement

Contact Info

Product

Resources

About