2018
DOI: 10.48550/arxiv.1807.11470
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Abstract: Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an indepth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoenc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 33 publications
(41 citation statements)
references
References 70 publications
0
41
0
Order By: Relevance
“…Other implementations have included cepstral coefficients [4] [36] or a variety of linguistic and acoustic features [37] [13]. Finally, widely recommended parameters have also been extracted by the WORLD vocoder [6] [38] [39].…”
Section: Acoustic Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Other implementations have included cepstral coefficients [4] [36] or a variety of linguistic and acoustic features [37] [13]. Finally, widely recommended parameters have also been extracted by the WORLD vocoder [6] [38] [39].…”
Section: Acoustic Featuresmentioning
confidence: 99%
“…Following these techniques, the generated sound can be conditioned on specific traits such as speaker's voice [47] [27], independent pitch [3] [48] [36], linguistic features [49] [17] or latent representations [4] [45]. Instead of one-hot-embeddings, some implementations have also used a confusion matrix to capture a variation of emotions [39], while others provided supplementary positional information of each segment conditioning music to the artist or genre [43]. After training, the user is able to decide between the conditioning properties of the synthesised sound.…”
Section: Conditioning Representationsmentioning
confidence: 99%
“…Concerning controllable speech synthesis, [1] proposed to use a VAE and deploy a speech synthesis system that combines VAE with VoiceLoop [15]. Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE.…”
Section: Related Work and Challengesmentioning
confidence: 99%
“…In [5], the authors combine VAE and GMM and call it GMVAE. For more details concerning the different variants of such methods, an in-depth study of methods for unsupervised learning of control in speech synthesis is given in [4]. These works show that it is possible to build a latent space leading to variables that can be used to control the style of synthesized speech.…”
Section: Related Work and Challengesmentioning
confidence: 99%
“…There are several emotional speech synthesis methods. For example, emotional speech synthesis methods based on reference audio feature embedding [15,18], variational auto-encoder (VAE) and normalizing flows [1], etc. Furthermore, Lee et al [37] proposed emotional end-to-end neural speech synthesizer based on Tacotron.…”
Section: Introductionmentioning
confidence: 99%