Generalization Ability of MOS Prediction Networks

Cooper, Erica; Huang, Wen-Chin; Toda, Tomoki; Yamagishi, Junichi

doi:10.48550/arxiv.2110.02635

Cited by 7 publications

(35 citation statements)

References 16 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cooper et al [4] show that adding silence and changing speed as data augmentations can improve MOSNet while seems not helpful for SSL-based MOS prediction models. Although authors do not explain the motivation for choosing these augmentations, we consider that these augmentations do not influence MOS.…”

Section: Data Augmentationmentioning

confidence: 99%

“…In this paper, two datasets are involved in the experiments: BVCC [4] and BC2019 [17]. BVCC is a newly collected MOS dataset that contains 7106 English samples from previous Blizzard Challenge for TTS [18,19,20,21,22] and Voice Conversion Challenge [23,24,25,26,27] as well as synthesized samples from systems implemented in ESPNet [28].…”

Section: Experiments Setupmentioning

confidence: 99%

“…Three baseline models are included in the experiments for comparison: LDNet [3], MOSA-Net [11], SSL-MOS [4]. We use the checkpoints provided by their official implementation repositories.…”

Section: Experiments Setupmentioning

confidence: 99%

“…LDNet merges the MOS prediction network and the judge-dependent part in MB-Net into a single encoder-decoder network. Cooper et al [4] improves MOSNet by two data augmentations, changing speed and adding silence.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, it is shown to be effective in predicting MOS. Cooper et al [4] study the generalizability of SSL models. The SSL models are fine-tuned on English TTS and VC MOS datasets and transferred to Chinese and Japanese speech.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

Tseng¹,

Kao²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems. Since collecting MOS is time-consuming, it would be desirable if there are accurate MOS prediction models for automatic evaluation. In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes domain-adaptive pre-training to further pre-train selfsupervised learning models on synthetic speech. And a proposed module is added to model the opinion score distribution of each utterance. With the proposed components, DDOS outperforms previous works on BVCC dataset. And the zeroshot transfer result on BC2019 dataset is significantly improved. DDOS also wins second place in Interspeech 2022 VoiceMOS challenge in terms of system-level score.

show abstract

Section: Data Augmentationmentioning

confidence: 99%

Section: Experiments Setupmentioning

confidence: 99%

“…Three baseline models are included in the experiments for comparison: LDNet [3], MOSA-Net [11], SSL-MOS [4]. We use the checkpoints provided by their official implementation repositories.…”

Section: Experiments Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

Tseng¹,

Kao²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

show abstract

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Martinez¹,

Ragano²,

Hines³

2022

Preprint

View full text Add to dashboard Cite

Recent studies have shown how self-supervised models can produce accurate speech quality predictions. Speech representations generated by the pre-trained wav2vec 2.0 model allows constructing robust predicting models using small amounts of annotated data. This opens the possibility of developing strong models in scenarios where labelled data is scarce. It is known that fine-tuning improves the model's performance; however, it is unclear how the data (e.g., language, amount of samples) used for fine-tuning is influencing that performance. In this paper, we explore how using different speech corpus to fine-tune the wav2vec 2.0 can influence its performance. We took four speech datasets containing degradations found in common conferencing applications and fine-tuned wav2vec 2.0 targeting different languages and data size scenarios. The fine-tuned models were tested across all four conferencing datasets plus an additional dataset containing synthetic speech and they were compared against three external baseline models. Results showed that fine-tuned models were able to compete with baseline models. Larger fine-tune data guarantee better performance; meanwhile, diversity in language helped the models deal with specific languages. Further research is needed to evaluate other wav2vec 2.0 models pre-trained with multi-lingual datasets and to develop prediction models that are more resilient to language diversity.

show abstract

Generalization Ability of MOS Prediction Networks

Cited by 7 publications

References 16 publications

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Contact Info

Product

Resources

About