YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Casanova, Edresson; Weber, Julian; Shulby, Christopher; Cândido, Arnaldo; Golge, Eren; Ponti, Moacir Antonelli

doi:10.48550/arxiv.2112.02418

Cited by 6 publications

(13 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used 3 languages/training datasets for the TTS model, as follows: English: VCTK [17] dataset, containing 44 hours of speech from 109 speakers, sampled at 48KHz. We divided the VCTK dataset into training, development and test subsets following [6]. To further increase the number of speakers for training, we used the subsets train-clean-100 and train-clean-360 from LibriTTS [18].…”

Section: Audio Datasetsmentioning

confidence: 99%

“…As the authors did not use a soundproof studio, the dataset contains some environmental noise. Following [6], we resampled the audios to 16Khz and used FullSubNet [22] as a denoiser. For development, we randomly selected 500 samples, leaving the rest for training.…”

Section: Audio Datasetsmentioning

confidence: 99%

“…Notably, most TTS systems are tailored for a single speaker, but many applications could benefit from the newspeaker synthesis, i.e., not seen during training, employing only a few seconds of the target speech. This approach is called zeroshot multi-speaker TTS (ZS-TTS) as in [4,5,6].…”

Section: Introductionmentioning

confidence: 99%

“…In our previous work [6], we presented YourTTS a zeroshot multi-speaker TTS model that showed good results for Portuguese using only a single-speaker dataset in the target language. In this paper, we combine the power of the YourTTS model with that of Wav2vec 2.0 [19] trained in a self-supervised way on 100 thousand hours of speech in 23 different languages [20].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A single speaker is almost all you need for automatic speech recognition

Casanova¹,

Shulby²,

Korolev³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We explore the use of speech synthesis and voice conversion applied to augment datasets for automatic speech recognition (ASR) systems, in scenarios with only one speaker available for the target language. Through extensive experiments, we show that our approach achieves results compared to the state-of-theart (SOTA) and requires only one speaker in the target language during speech synthesis/voice conversion model training. Finally, we show that it is possible to obtain promising results in the training of an ASR model with our data augmentation method and only a single real speaker in different target languages.

show abstract

Section: Audio Datasetsmentioning

confidence: 99%

Section: Audio Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A single speaker is almost all you need for automatic speech recognition

Casanova¹,

Shulby²,

Korolev³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, several non-autoregressive flow-based architectures for multispeaker TTS have been proposed [20,21]. These models can perform zero-shot voice cloning and potentially generalize to long utterances.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

Gorodetskii¹,

Ozhiganov²

2022

Preprint

View full text Add to dashboard Cite

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by conditioning both the synthesizer and vocoder on a speaker encoder that has been pretrained on a large corpus of diverse data. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances, and conclude that the proposed model can produce intelligible synthetic speech for extremely long utterances, while preserving a high extent of naturalness and similarity for short texts.

show abstract