Principles for Learning Controllable TTS from Annotated and Latent Variation

Henter, Gustav Eje; Lorenzo-Trueba, Jaime; Wang, Xin; Yamagishi, Junichi

doi:10.21437/interspeech.2017-171

Cited by 16 publications

(22 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Henter et al [39] and Zhu et al [34] succeeded in creating nuances of emotions without using emotion degree annotations, nevertheless, this work still relies on emotion labels as input.…”

Section: Related Workmentioning

confidence: 99%

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis

et al. 2021

View full text Add to dashboard Cite

Great improvement has been made in the field of expressive audiovisual Textto-Speech synthesis (EAVTTS) thanks to deep learning techniques. However, generating realistic speech is still an open issue and researchers in this area have been focusing lately on controlling the speech variability. In this paper, we use different neural architectures to synthesize emotional speech. We study the application of unsupervised learning techniques for emotional speech modeling as well as methods for restructuring emotions representation to make it continuous and more flexible. This manipulation of the emotional representation should allow us to generate new styles of speech by mixing emotions. We first present our expressive audiovisual corpus. We validate the emotional content of this corpus with three perceptual experiments using acoustic only, visual only and audiovisual stimuli. After that, we analyze the performance of a fully connected neural network in learning characteristics specific to different emotions for the phone duration aspect and the acoustic and visual modalities. We also study the contribution of a joint and separate training of the acoustic and visual modalities in the quality of the generated synthetic speech. In the second part of this paper, we use a conditional variational auto-encoder (CVAE) architecture to learn a latent representation of emotions. We applied this method in an unsupervised manner to generate features of expressive speech. We used a

show abstract

“…Henter et al [39] and Zhu et al [34] succeeded in creating nuances of emotions without using emotion degree annotations, nevertheless, this work still relies on emotion labels as input.…”

Section: Related Workmentioning

confidence: 99%

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis

et al. 2021

View full text Add to dashboard Cite

show abstract

“…These are jointly trained with the weights of the TTS model using backpropagation [14]. For example, [10] and [11] trained embedding vectors in a supervised manner using emotion labels. Recently, several studies have adopted an unsupervised method in which embedding vectors are trained in a deep learning framework, but without annotated labels [12], [13], [15].…”

Section: Introductionmentioning

confidence: 99%

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

et al. 2020

View full text Add to dashboard Cite

In this paper, we propose an effective technique to transplant a source speaker's emotional expression to a new target speaker's voice within an end-to-end text-to-speech (TTS) framework. We modify an expressive TTS model pre-trained using a source speaker's emotional speech database to reflect the voice characteristics of a target speaker for which only a neutral speech database is available. We set two adaptation criteria to achieve this. One criterion is to minimize the reconstruction loss between the target speaker's recorded and synthesized speech, such that the synthesized speech has the target speaker's voice characteristics. The other criterion is to minimize the emotion loss between the emotion embedding vectors extracted from the reference expressive speech and the target speaker's synthesized expressive speech, which is essential to preserve expressiveness. Since the two criteria are applied alternately in the adaptation process, we are able to avoid the kind of bias issues frequently encountered in similar tasks. The proposed adaptation technique demonstrates more effective performance compared to conventional approaches in both quantitative and qualitative evaluations.

show abstract

“…This paper addresses the problem of synthesizing expressive speech without relying on speech expression labels, which we refer to as unsupervised expressive speech synthesis (UESS). Many studies have reported that such labels are helpful for modeling complex audio data [5,6,4,7]. Unsupervised methods, however, are more desirable because expressive speech is easy to obtain from video hosting websites (e.g., Youtube) or audiobooks but annotating such sources is costly.…”

Section: Introductionmentioning

confidence: 99%

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Akuzawa¹,

Iwasawa²,

Matsuo³

2018

Interspeech 2018

113

View full text Add to dashboard Cite

Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Bliz-zard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

show abstract

Principles for Learning Controllable TTS from Annotated and Latent Variation

Cited by 16 publications

References 22 publications

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Contact Info

Product

Resources

About