Marginal Effects of Language and Individual Raters on Speech Quality Models

Chinen, Michael

doi:10.1109/access.2021.3112165

Cited by 4 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The tool extends possible ways to validate data consistency of responses obtained during a MQA subjective experiment [8]. Importantly, our work was noticed by practitioners in the MQA field and referred to in [9], [10], and [11].…”

Section: Introductionmentioning

confidence: 75%

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Ćmiel¹,

Nawała²,

Janowski³

et al. 2022

Preprint

View full text Add to dashboard Cite

A class of discrete probability distributions contains distributions with limited support. A typical example is some variant of a Likert scale, with response mapped to either the {1, 2, . . . , 5} or {−3, −2, . . . , 2, 3} set. An interesting subclass of discrete distributions with finite support are distributions limited to two parameters and having no more than one change in probability monotonicity. The main contribution of this paper is to propose a family of distributions fitting the above description, which we call the Generalised Score Distribution (GSD) class. The proposed GSD class covers the whole set of possible mean and variances, for any fixed and finite support. Furthermore, the GSD class can be treated as an underdispersed continuation of a reparametrized beta-binomial distribution. The GSD class parameters are intuitive and can be easily estimated by the method of moments. We also offer a Maximum Likelihood Estimation (MLE) algorithm for the GSD class and evidence that the class properly describes response distributions coming from 24 Multimedia Quality Assessment experiments. At last, we show that the GSD class can be represented as a sum of dichotomous zero-one random variables, which points to an interesting interpretation of the class.

show abstract

Section: Introductionmentioning

confidence: 75%

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Ćmiel¹,

Nawała²,

Janowski³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In parallel to the deep learning improvements that are mostly driven by extracting more useful information from the waveform, researchers have made progress in obtaining a better understanding of the biases and factors of listening tests that are independent of the speech signal being rated [15]. For example, research has found that a significant amount of bias may be attributable to properties of the listeners, including their language and culture, as well as their individual tendencies to rate high or low [16,17]. LDNet [18], a baseline model for this challenge, considers rater metadata.…”

Section: Introductionmentioning

confidence: 99%

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Chinen¹,

Skoglund²,

Reddy³

et al. 2022

Interspeech 2022

Self Cite

View full text Add to dashboard Cite

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-tospeech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

show abstract

“…Despite their promising results, data-driven models are exposed to bias depending on the type of data used to train them. Collecting data that does not bias the model is a challenge (specifically neutral to: speaker voice/accent, language, degradation) [9,10,11].…”

Section: Introductionmentioning

confidence: 99%

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Martinez¹,

Ragano²,

Hines³

2022

Preprint

View full text Add to dashboard Cite

Recent studies have shown how self-supervised models can produce accurate speech quality predictions. Speech representations generated by the pre-trained wav2vec 2.0 model allows constructing robust predicting models using small amounts of annotated data. This opens the possibility of developing strong models in scenarios where labelled data is scarce. It is known that fine-tuning improves the model's performance; however, it is unclear how the data (e.g., language, amount of samples) used for fine-tuning is influencing that performance. In this paper, we explore how using different speech corpus to fine-tune the wav2vec 2.0 can influence its performance. We took four speech datasets containing degradations found in common conferencing applications and fine-tuned wav2vec 2.0 targeting different languages and data size scenarios. The fine-tuned models were tested across all four conferencing datasets plus an additional dataset containing synthetic speech and they were compared against three external baseline models. Results showed that fine-tuned models were able to compete with baseline models. Larger fine-tune data guarantee better performance; meanwhile, diversity in language helped the models deal with specific languages. Further research is needed to evaluate other wav2vec 2.0 models pre-trained with multi-lingual datasets and to develop prediction models that are more resilient to language diversity.

show abstract

Marginal Effects of Language and Individual Raters on Speech Quality Models

Cited by 4 publications

References 19 publications

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Contact Info

Product

Resources

About