Hierarchical Generative Modeling for Controllable Speech Synthesis

Hsu, Wei-Ning; Weiss, Ron; Zen, Heiga; Wu, Yangjie; Wang, Yuxuan; Chen, Yuan; Jia, Yali; Chen, Zhifeng; Shen, Jonathan; Nguyen, Patrick; Pang, Ruoming

doi:10.48550/arxiv.1810.07217

Cited by 31 publications

(66 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concerning controllable speech synthesis, [1] proposed to use a VAE and deploy a speech synthesis system that combines VAE with VoiceLoop [15]. Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE.…”

Section: Related Work and Challengesmentioning

confidence: 99%

See 1 more Smart Citation

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Tits¹,

Haddad²,

Dutoit³

2021

Preprint

View full text Add to dashboard Cite

In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance. CCS CONCEPTS• Human-centered computing → Walkthrough evaluations.

show abstract

Section: Related Work and Challengesmentioning

confidence: 99%

“…Some other researches have used the concept of VAE [4,5] for controllable speech synthesis. In [5], the authors combine VAE and GMM and call it GMVAE. For more details concerning the different variants of such methods, an in-depth study of methods for unsupervised learning of control in speech synthesis is given in [4].…”

Section: Related Work and Challengesmentioning

confidence: 99%

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Tits¹,

Haddad²,

Dutoit³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Along with these latent representation based methods, another set of studies focus on modelling prosody in a hierarchical manner along with a reference encoder [12,13,14]. Here the input text is represented at various levels that are spanning from coarser (e.g., sentences) to finer (e.g., phonemes) levels.…”

Section: Introductionmentioning

confidence: 99%

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Abbas¹,

Bollepalli²,

Moinet³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale melspectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and framelevel spectrograms while Sentence-level MSS models sentencelevel spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.

show abstract

“…For example, in the field of autonomous driving, it is often inconvenient to obtain reference audio. As a result, multitask anthropomorphic speech synthesis methods without reference audio are designed, for instance, variational auto-encoder (VAE) [21]. However, the training of the model requires the use of audiobooks corpus.…”

Section: Introductionmentioning

confidence: 99%

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Chen

Ming

2021

Preprint

View full text Add to dashboard Cite

Text-to-Speech (TTS) synthesis plays an important role in human-computer interaction. Currently, most TTS technologies focus on the naturalness of speech, namely, making the speeches sound like humans. However, the key tasks of the expression of emotion and the speaker identity are ignored, which limits the application scenarios of TTS synthesis technology. To make the synthesized speech more realistic and

show abstract

Hierarchical Generative Modeling for Controllable Speech Synthesis

Cited by 31 publications

References 14 publications

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Contact Info

Product

Resources

About