Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10147
|View full text |Cite
|
Sign up to set email alerts
|

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Abstract: Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…Another recent challenge [24] focused on predicting speech quality in speech conferencing applications, and also saw several submissions, e.g. [25,26], making use of SSL representations. This task does involve spontaneous speech audio, but focuses only on assessing quality of speech transmission in online conferencing and not on asessing synthesized spontaneous speech from a TTS model.…”
Section: Quality Prediction Using Ssl Modelsmentioning
confidence: 99%
“…Another recent challenge [24] focused on predicting speech quality in speech conferencing applications, and also saw several submissions, e.g. [25,26], making use of SSL representations. This task does involve spontaneous speech audio, but focuses only on assessing quality of speech transmission in online conferencing and not on asessing synthesized spontaneous speech from a TTS model.…”
Section: Quality Prediction Using Ssl Modelsmentioning
confidence: 99%
“…SSSRs have been applied to metric prediction tasks, typically to quality prediction [22,23]. In [13], XLSR representations are used as feature extraction in a non-intrusive human MOS prediction network. Similarly, in [24] SSSRs are used for the same quality prediction task, but they are fine-tuned with a mean pooling layer rather than being used simply as feature extraction.…”
Section: Sssrs For Metric Predictionmentioning
confidence: 99%
“…A model structure inspired by [13] is chosen for the SI prediction network. Five feature extraction methods are used; outputs of G FE and G OL for both, HuBERT and XLSR representations, as well as a spectrogram representation denoted as SPEC.…”
Section: Model Structure and Experiments Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…It is demonstrated the encoder layers have a notably stronger correlation to the aforementioned evaluation measures than the output layers. Hidden unit BERT (HuBERT) [12] and XLSR [14] SSSRs are chosen as these have both been applied in related speech tasks previously but in different ways [15,16]. Following from this, the distances between clean and noisy SSSR features are then evaluated for their usefulness as loss functions to train speech enhancement models.…”
Section: Introductionmentioning
confidence: 99%