2021
DOI: 10.48550/arxiv.2112.02418
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Abstract: YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zeroshot multi-speaker and multilingual training. We achieved stateof-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for ze… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
12
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(13 citation statements)
references
References 24 publications
0
12
0
Order By: Relevance
“…We used 3 languages/training datasets for the TTS model, as follows: English: VCTK [17] dataset, containing 44 hours of speech from 109 speakers, sampled at 48KHz. We divided the VCTK dataset into training, development and test subsets following [6]. To further increase the number of speakers for training, we used the subsets train-clean-100 and train-clean-360 from LibriTTS [18].…”
Section: Audio Datasetsmentioning
confidence: 99%
See 3 more Smart Citations
“…We used 3 languages/training datasets for the TTS model, as follows: English: VCTK [17] dataset, containing 44 hours of speech from 109 speakers, sampled at 48KHz. We divided the VCTK dataset into training, development and test subsets following [6]. To further increase the number of speakers for training, we used the subsets train-clean-100 and train-clean-360 from LibriTTS [18].…”
Section: Audio Datasetsmentioning
confidence: 99%
“…As the authors did not use a soundproof studio, the dataset contains some environmental noise. Following [6], we resampled the audios to 16Khz and used FullSubNet [22] as a denoiser. For development, we randomly selected 500 samples, leaving the rest for training.…”
Section: Audio Datasetsmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, several non-autoregressive flow-based architectures for multispeaker TTS have been proposed [20,21]. These models can perform zero-shot voice cloning and potentially generalize to long utterances.…”
Section: Introductionmentioning
confidence: 99%