Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Akuzawa, Kei; Iwasawa, Yusuke; Matsuo, Yutaka

doi:10.21437/interspeech.2018-1113

Cited by 113 publications

(83 citation statements)

References 17 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…With regards to research, a number of themes do indeed echo topics within speech interface research. For instance, vocal quality clearly maps to work on speech synthesis, where developing more human-like, expressive [1], emotive [39] and personality-filled [51] voices is currently underway. User research has also focused on exploring the role of humanness in partner knowledge assumptions [15], vocal quality [16,37], partner identity [4], linguistic content [13] and conversational interactivity [13,45].…”

Section: Discussionmentioning

confidence: 99%

Mapping Perceptions of Humanness in Intelligent Personal Assistant Interaction

Doyle

Edwards

Dumbleton

et al. 2019

Proceedings of the 21st International Conference on Human-Computer Interaction With Mobile Devices and Services

View full text Add to dashboard Cite

Humanness is core to speech interface design. Yet little is known about how users conceptualise perceptions of humanness and how people define their interaction with speech interfaces through this. To map these perceptions n=21 participants held dialogues with a human and two speech interface based intelligent personal assistants, and then reflected and compared their experiences using the repertory grid technique. Analysis of the constructs show that perceptions of humanness are multidimensional, focusing on eight key themes: partner knowledge set, interpersonal connection, linguistic content, partner performance and capabilities, conversational interaction, partner identity and role, vocal qualities and behavioral affordances. Through these themes, it is clear that users define the capabilities of speech interfaces differently to humans, seeing them as more formal, fact based, impersonal and less authentic. Based on the findings, we discuss how the themes help to scaffold, categorise and target research and design efforts, considering the appropriateness of emulating humanness.

show abstract

Section: Discussionmentioning

confidence: 99%

Mapping Perceptions of Humanness in Intelligent Personal Assistant Interaction

Doyle

Edwards

Dumbleton

et al. 2019

Proceedings of the 21st International Conference on Human-Computer Interaction With Mobile Devices and Services

View full text Add to dashboard Cite

show abstract

“…VAEs have been demonstrated for speech synthesis [18,19], voice conversion [20], and intonation modelling [21,Chapter 7]. Discrete representations have also been incorporated into the VAE framework [22,23].…”

Section: Related Workmentioning

confidence: 99%

Using generative modelling to produce varied intonation for speech synthesis

Hodari¹,

Watts²,

King³

2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

View full text Add to dashboard Cite

Unlike human speakers, typical text-to-speech (TTS) systems are unable to produce multiple distinct renditions of a given sentence. This has previously been addressed by adding explicit external control. In contrast, generative models are able to capture a distribution over multiple renditions and thus produce varied renditions using sampling.Typical neural TTS models learn the average of the data because they minimise mean squared error. In the context of prosody, taking the average produces flatter, more boring speech: an "average prosody". A generative model that can synthesise multiple prosodies will, by design, not model average prosody.We use variational autoencoders (VAE) which explicitly place the most "average" data close to the mean of the Gaussian prior. We propose that by moving towards the tails of the prior distribution, the model will transition towards generating more idiosyncratic, varied renditions.Focusing here on intonation, we investigate the trade-off between naturalness and intonation variation and find that typical acoustic models can either be natural, or varied, but not both. However, sampling from the tails of the VAE prior produces much more varied intonation than the traditional approaches, whilst maintaining the same level of naturalness.

show abstract

“…Work done during internship at Microsoft STC Asia text generation [9], image generation [10,11] and speech generation [12,13] tasks. VAE has many merits, such as learning disentangled factors, smoothly interpolating or continuously sampling between latent representations which can obtain interpretable homotopies [9].…”

Section: Introductionmentioning

confidence: 99%

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Zhang

Pan

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

204

148

View full text Add to dashboard Cite

In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner. The style representation learned through VAE shows good properties such as disentangling, scaling, and combination, which makes it easy for style control. Style transfer can be achieved in this framework by first inferring style representation through the recognition network of VAE, then feeding it into TTS network to guide the style in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in training, several techniques are adopted. Finally, the proposed model shows good performance of style control and outperforms Global Style Token (GST) model in ABX preference tests on style transfer.

show abstract

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Cited by 113 publications

References 17 publications

Mapping Perceptions of Humanness in Intelligent Personal Assistant Interaction

Mapping Perceptions of Humanness in Intelligent Personal Assistant Interaction

Using generative modelling to produce varied intonation for speech synthesis

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Contact Info

Product

Resources

About