2018
DOI: 10.48550/arxiv.1810.07217
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hierarchical Generative Modeling for Controllable Speech Synthesis

Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
65
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 31 publications
(66 citation statements)
references
References 14 publications
1
65
0
Order By: Relevance
“…Concerning controllable speech synthesis, [1] proposed to use a VAE and deploy a speech synthesis system that combines VAE with VoiceLoop [15]. Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE.…”
Section: Related Work and Challengesmentioning
confidence: 99%
See 1 more Smart Citation
“…Concerning controllable speech synthesis, [1] proposed to use a VAE and deploy a speech synthesis system that combines VAE with VoiceLoop [15]. Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE.…”
Section: Related Work and Challengesmentioning
confidence: 99%
“…Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE. For more details concerning the different variants of such methods, an in-depth study of methods for unsupervised learning of control in speech synthesis is given in [4].…”
Section: Related Work and Challengesmentioning
confidence: 99%
“…Along with these latent representation based methods, another set of studies focus on modelling prosody in a hierarchical manner along with a reference encoder [12,13,14]. Here the input text is represented at various levels that are spanning from coarser (e.g., sentences) to finer (e.g., phonemes) levels.…”
Section: Introductionmentioning
confidence: 99%
“…For example, in the field of autonomous driving, it is often inconvenient to obtain reference audio. As a result, multitask anthropomorphic speech synthesis methods without reference audio are designed, for instance, variational auto-encoder (VAE) [21]. However, the training of the model requires the use of audiobooks corpus.…”
Section: Introductionmentioning
confidence: 99%