A Transfer and Multi-Task Learning based Approach for MOS Prediction

Tian, Xin; Fu, Kaiqi; Gao, Shaojun; Gu, Yu; Wang, Kai; Li, Wei; Ma, Zhuo

doi:10.21437/interspeech.2022-10022

Cited by 8 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• The system from ByteDance AI-LAB (T20) [140] ranked 4th in terms of both system-and utterance-level SRCC. It was based on LDNet, and they combined the main and OOD track datasets with a shared encoder and separate decoders.…”

Section: Team Approachesmentioning

confidence: 99%

A review on subjective and objective evaluation of synthetic speech

Cooper,

Huang,

Tsao

et al. 2024

Acoust. Sci. & Tech.

View full text Add to dashboard Cite

Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospective. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

show abstract

Section: Team Approachesmentioning

confidence: 99%

A review on subjective and objective evaluation of synthetic speech

Cooper,

Huang,

Tsao

et al. 2024

Acoust. Sci. & Tech.

View full text Add to dashboard Cite

show abstract

“…Their superiority over other models was highlighted in the VoiceMOS Challenge 2022 [14], a shared task using common datasets for MOS prediction, where winning teams extended the SSL-MOS baseline to outperform it only by a margin on the third decimal point of the correlation metrics. Some interesting proposed additions to the baseline include ensembling [15,16], multi-task learning [17], and use of speech recognizers to recreate the phoneme sequence [15] or to get ASR evaluations [16]. As the training dataset included VC and TTS systems spanning over a decade [18], it is unclear if the trained models are able to distinguish between similar systems and utterances, which is a realistic evaluation scenario for TTS researchers.…”

Section: Related Workmentioning

confidence: 99%

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Vioni¹,

Georgia²,

Ellinas³

et al. 2022

Preprint

View full text Add to dashboard Cite

Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained selfsupervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.

show abstract

“…BVCC is a collection of MOS ratings from its own large-scale listening test on samples obtained from 6 years of the Blizzard Challenge (BC) and 3 years of the Voice Conversion Challenge (VCC). BVCC was used as baseline training data in the challenge, and greatly enabled following research, e.g., [11,12,13].…”

Section: Automatic Prediction Of Mosmentioning

confidence: 99%

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Do¹,

Coler²,

Dijkstra³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We provide a systematic review of past studies that use multilingual data for text-to-speech (TTS) of low-resource languages (LRLs). We focus on the strategies used by these studies for incorporating multilingual data and how they affect output speech quality. To investigate the difference in output quality between corresponding monolingual and multilingual models, we propose a novel measure to compare this difference across the included studies and their various evaluation metrics. This measure, called the Multilingual Model Effect (MLME), is found to be affected by: acoustic model architecture, the difference ratio of target language data between corresponding multilingual and monolingual experiments, the balance ratio of target language data to total data, and the amount of target language data used. These findings can act as reference for data strategies in future experiments with multilingual TTS models for LRLs. Language family classification, despite being widely used, is not found to be an effective criterion for selecting source languages.

show abstract

A Transfer and Multi-Task Learning based Approach for MOS Prediction

Cited by 8 publications

References 0 publications

A review on subjective and objective evaluation of synthetic speech

A review on subjective and objective evaluation of synthetic speech

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages

Contact Info

Product

Resources

About