SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Casanova, Edresson; Shulby, Christopher; Golge, Eren; Müller, Nicolas M.; Oliveira, Frederico Santos de; Cândido, Arnaldo; Soares, Anderson da Silva; Aluísio, Sandra Maria; Ponti, Moacir Antonelli

doi:10.48550/arxiv.2104.05557

Cited by 11 publications

(13 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Zero-shot adaptation. Some works [9,43,139,55,32] conduct zero-shot adaptation, which leverage a speaker encoder to extract speaker embeddings given a reference audio. This scenario is quite appealing since no adaptation data and parameters are needed.…”

Section: Efficient Adaptationmentioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models, and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

show abstract

Section: Efficient Adaptationmentioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The proposed model was pre-trained for 500k iterations as in Section 3.2, then fine-tuned for 10k iterations on LJ-30min and 100k iterations on the other subsets. For comparison, we set the baselines with Meta-StyleSpeech [9], SC-GlowTTS 2 [10], and VITS-baseline. The VITS-baseline has the same architecture as the fine-tuned model but without pre-training, and these baselines were trained on the fine-tuning parts except LJ-30min.…”

Section: Zero-shot Multi-speaker Ttsmentioning

confidence: 99%

“…The difficulty and cost of collecting the labeled dataset can limit the application in various fields. In the zero-shot multi-speaker TTS (ZS-TTS) [7][8][9][10], which synthesize voices of new speakers with only a few seconds of reference speech, it becomes more difficult to collect labeled dataset. The dataset should be composed of as many speakers as possible for better speaker generalization.…”

Section: Introductionmentioning

confidence: 99%

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Kim,

Jeong,

Choi

et al. 2022

Preprint

View full text Add to dashboard Cite

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.

show abstract

“…In the recent neuralbased methods, conditioning on speaker embeddings has been a popular strategy. Specifically, the speaker representation is commonly extracted by a speaker embedding model and then is used as the conditional attribute in a TTS model [29], [30], [31], [32]. For instance, in [29], the speaker embedding vectors are obtained from a separately trained speaker verification model, and the TTS model Tacotron2 [12] conditioned on the speaker embeddings is used for multi-speaker speech synthesis.…”

Section: A Text-to-speech Synthesismentioning

confidence: 99%

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Wang,

Xie,

Zhu

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that are inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-tospeech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multispeaker TTS model that gets the speaker identity information from face images instead of speech, which allows us to synthesize a personalized voice on the basis of the input face image. To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed. Extensive experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and nonpersons. Synthesized speech shows consistency with the given face regarding to the synthesized voice's timbre and one's appearance in the image, and the proposed landmark-based talking head method outperforms the state-of-the-art landmark-based method on generating natural talking head videos.

show abstract

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Cited by 11 publications

References 22 publications

A Survey on Neural Speech Synthesis

A Survey on Neural Speech Synthesis

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Contact Info

Product

Resources

About