A Survey on Neural Speech Synthesis

Tan, Xu; Qin, Tao; Soong, Frank; Liu, Tie-Yan

doi:10.48550/arxiv.2106.15561

Cited by 60 publications

(85 citation statements)

References 286 publications

(782 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This was done because although the latest works in ZS-TTS [3,4,10] only use the VCTK dataset, this dataset has a limited number of speakers (109) and little variety of recording conditions. Thus, after training with only this dataset, in general, ZS-TTS models do not generalize satisfactorily to new speakers where recording conditions or voice characteristics are very different than those seen in the training [12].…”

Section: Methodsmentioning

confidence: 93%

“…The different recording conditions are a challenge for the generalization of the zero-shot multi-speaker TTS models. In addition, speakers who have a voice that differs greatly from those seen in training also become a challenge [12]. Nevertheless, to show the potential of our model for adaptation to new speakers/recording conditions, we selected from 20 to 61 seconds of speech for 2 speakers (1M/1F) from Portuguese and the same for English in the Common Voice [37] dataset.…”

Section: Speaker Adaptationmentioning

confidence: 99%

“…Also, we would like to thank the Defined.ai 11 for making industrial-level MOS testing so easily available. Finally, we would like to thank all contributors to the Coqui TTS repository 12 , this work was only possible thanks to the commitment of all.…”

Section: Acknowledgementsmentioning

confidence: 99%

“…ZS-TTS models still require a large number of speakers for training, making it difficult to obtain highquality models in low-resource languages. Furthermore, according to [12], the quality of current ZS-TTS models is not sufficiently good, especially for target speakers with speech characteristics that differ from those seen in training. Although SC-GlowTTS [10] achieved promising results with only 11 speakers from the VCTK dataset [13], generally, limiting the number and variety of training speakers further hinders the generalization of the model for unseen voices.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Casanova¹,

Weber²,

Shulby³

et al. 2021

Preprint

View full text Add to dashboard Cite

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zeroshot multi-speaker and multilingual training. We achieved stateof-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zeroshot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

show abstract

Section: Methodsmentioning

confidence: 93%

Section: Speaker Adaptationmentioning

confidence: 99%

Section: Acknowledgementsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Casanova¹,

Weber²,

Shulby³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Considering the aforementioned benefits, TTS is undoubtedly an essential speech processing technology for any language. In recent years, TTS research has progressed remarkably thanks to neural network-based architectures (Tan et al, 2021), regularly organized challenges (Black and Tokuda, 2005;Dunbar et al, 2019), and open-source datasets (Ito and Johnson, 2017;Zen et al, 2019;Shi et al, 2020). Especially, impressive results have been achieved for commercially viable languages, such as English and Mandarin.…”

Section: Introductionmentioning

confidence: 99%

KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics

Mussakhojayeva¹,

Khassanov²,

Varol³

2022

Preprint

View full text Add to dashboard Cite

We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size is increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage is diversified with the help of new sources, including a book and Wikipedia articles. This corpus is necessary for building high-quality TTS systems for Kazakh, a Central Asian agglutinative language from the Turkic family, which presents several linguistic challenges. We describe the corpus construction process and provide the details of the training and evaluation procedures for the TTS system. Our experimental results indicate that the constructed corpus is sufficient to build robust TTS models for real-world applications, with a subjective mean opinion score of above 4.0 for all the five speakers. We believe that our corpus will facilitate speech and language research for Kazakh and other Turkic languages, which are widely considered to be low-resource due to the limited availability of free linguistic data. The constructed corpus, code, and pretrained models are publicly available in our GitHub repository.

show abstract