Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-171
|View full text |Cite
|
Sign up to set email alerts
|

Principles for Learning Controllable TTS from Annotated and Latent Variation

Abstract: For building flexible and appealing high-quality speech synthesisers, it is desirable to be able to accommodate and reproduce fine variations in vocal expression present in natural speech. Synthesisers can enable control over such output properties by adding adjustable control parameters in parallel to their text input. If not annotated in training data, the values of these control inputs can be optimised jointly with the model parameters. We describe how this established method can be seen as approximate maxi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
21
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(22 citation statements)
references
References 22 publications
1
21
0
Order By: Relevance
“…Henter et al [39] and Zhu et al [34] succeeded in creating nuances of emotions without using emotion degree annotations, nevertheless, this work still relies on emotion labels as input.…”
Section: Related Workmentioning
confidence: 99%
“…Henter et al [39] and Zhu et al [34] succeeded in creating nuances of emotions without using emotion degree annotations, nevertheless, this work still relies on emotion labels as input.…”
Section: Related Workmentioning
confidence: 99%
“…These are jointly trained with the weights of the TTS model using backpropagation [14]. For example, [10] and [11] trained embedding vectors in a supervised manner using emotion labels. Recently, several studies have adopted an unsupervised method in which embedding vectors are trained in a deep learning framework, but without annotated labels [12], [13], [15].…”
Section: Introductionmentioning
confidence: 99%
“…This paper addresses the problem of synthesizing expressive speech without relying on speech expression labels, which we refer to as unsupervised expressive speech synthesis (UESS). Many studies have reported that such labels are helpful for modeling complex audio data [5,6,4,7]. Unsupervised methods, however, are more desirable because expressive speech is easy to obtain from video hosting websites (e.g., Youtube) or audiobooks but annotating such sources is costly.…”
Section: Introductionmentioning
confidence: 99%