2021
DOI: 10.48550/arxiv.2104.05557
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Abstract: In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen in training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 22 publications
0
11
0
Order By: Relevance
“…• Zero-shot adaptation. Some works [9,43,139,55,32] conduct zero-shot adaptation, which leverage a speaker encoder to extract speaker embeddings given a reference audio. This scenario is quite appealing since no adaptation data and parameters are needed.…”
Section: Efficient Adaptationmentioning
confidence: 99%
“…• Zero-shot adaptation. Some works [9,43,139,55,32] conduct zero-shot adaptation, which leverage a speaker encoder to extract speaker embeddings given a reference audio. This scenario is quite appealing since no adaptation data and parameters are needed.…”
Section: Efficient Adaptationmentioning
confidence: 99%
“…The proposed model was pre-trained for 500k iterations as in Section 3.2, then fine-tuned for 10k iterations on LJ-30min and 100k iterations on the other subsets. For comparison, we set the baselines with Meta-StyleSpeech [9], SC-GlowTTS 2 [10], and VITS-baseline. The VITS-baseline has the same architecture as the fine-tuned model but without pre-training, and these baselines were trained on the fine-tuning parts except LJ-30min.…”
Section: Zero-shot Multi-speaker Ttsmentioning
confidence: 99%
“…The difficulty and cost of collecting the labeled dataset can limit the application in various fields. In the zero-shot multi-speaker TTS (ZS-TTS) [7][8][9][10], which synthesize voices of new speakers with only a few seconds of reference speech, it becomes more difficult to collect labeled dataset. The dataset should be composed of as many speakers as possible for better speaker generalization.…”
Section: Introductionmentioning
confidence: 99%
“…In the recent neuralbased methods, conditioning on speaker embeddings has been a popular strategy. Specifically, the speaker representation is commonly extracted by a speaker embedding model and then is used as the conditional attribute in a TTS model [29], [30], [31], [32]. For instance, in [29], the speaker embedding vectors are obtained from a separately trained speaker verification model, and the TTS model Tacotron2 [12] conditioned on the speaker embeddings is used for multi-speaker speech synthesis.…”
Section: A Text-to-speech Synthesismentioning
confidence: 99%