TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Casanova, Edresson; Cândido, Arnaldo; Shulby, Christopher; Oliveira, Frederico Santos de; Teixeira, João Paulo; Ponti, Moacir Antonelli; Aluísio, Sandra Maria

doi:10.1007/s10579-021-09570-4

Cited by 12 publications

(13 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baseline: We choose the SOTA zero-shot TTS model YourTTS [Casanova et al, 2022b] as the baseline, which is trained on a combined dataset of VCTK [Veaux et al, 2016], LibriTTS , and TTS-Portuguese [Casanova et al, 2022a]. We use their released checkpoint * .…”

Section: Experiments Setupmentioning

confidence: 99%

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang¹,

Chen²,

Wu³

et al. 2023

Preprint

View full text Add to dashboard Cite

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

show abstract

Section: Experiments Setupmentioning

confidence: 99%

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang¹,

Chen²,

Wu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Regarding signal quality, the Signal-to-Noise Ratio (SNR) holds significant importance, both during content filtering [36,37] and data recording stages [38][39][40][41]. Linguistic considerations also come into play, with some researchers emphasising the need for balanced phonemic or supraphonemic units within the dataset [38,39,41,42].…”

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract

“…Portuguese: TTS-Portuguese Corpus [21], a singlespeaker male dataset in Brazilian Portuguese (pt-BR) containing ca. 10 hrs, sampled at 48KHz.…”

Section: Audio Datasetsmentioning

confidence: 99%

A single speaker is almost all you need for automatic speech recognition

Casanova¹,

Shulby²,

Korolev³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We explore the use of speech synthesis and voice conversion applied to augment datasets for automatic speech recognition (ASR) systems, in scenarios with only one speaker available for the target language. Through extensive experiments, we show that our approach achieves results compared to the state-of-theart (SOTA) and requires only one speaker in the target language during speech synthesis/voice conversion model training. Finally, we show that it is possible to obtain promising results in the training of an ASR model with our data augmentation method and only a single real speaker in different target languages.

show abstract

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Cited by 12 publications

References 27 publications

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

A single speaker is almost all you need for automatic speech recognition

Contact Info

Product

Resources

About