Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1774
|View full text |Cite
|
Sign up to set email alerts
|

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Abstract: Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
47
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(57 citation statements)
references
References 10 publications
0
47
0
Order By: Relevance
“…Zero-shot multi-speaker TTS (ZS-TTS) employs only a few seconds of speech dataset to create synthesized voices for target speakers. SC-GlowTTS is a novel model (GlowTTS and HiFi-GAN) to enhance the likeness to the target speaker using ZS-TTS [8]. The all pass warp (APW) has been integrated into neural network frameworks and investigated in ZS-TTS [35].…”
Section: Related Workmentioning
confidence: 99%
“…Zero-shot multi-speaker TTS (ZS-TTS) employs only a few seconds of speech dataset to create synthesized voices for target speakers. SC-GlowTTS is a novel model (GlowTTS and HiFi-GAN) to enhance the likeness to the target speaker using ZS-TTS [8]. The all pass warp (APW) has been integrated into neural network frameworks and investigated in ZS-TTS [35].…”
Section: Related Workmentioning
confidence: 99%
“…The term "Audio DeepFake" covers solutions that can create artificially modified speech. These solutions can either generate new utterances using Text-To-Speech (TTS) [1]- [3] and Voice Cloning [4], [5] methods, or modify existing utterances and therefore change it to someone else -Voice Conversion [6], [7]. The more recent architectures not only provide a synthesis of a speech but also focus on proper intonation, stress, and rhythm.…”
Section: Introductionmentioning
confidence: 99%
“…They are composed of two complementary systems: automatic speech recognition (ASR), and text-to-speech (TTS). In case of a TTS model, it receives text (or phonemes) as input and produces synthesized speech as output with desired features like emotion, intonation, rhythm, etc [3].…”
Section: Introductionmentioning
confidence: 99%