Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Zaïdi, Julian; Seuté, Hugo; Niekerk, Benjamin van; Carbonneau, Marc-André

doi:10.48550/arxiv.2108.02271

Cited by 3 publications

(11 citation statements)

References 11 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These models can generalize to new styles, but also allow for interpolation in the style latent space. Other work has even made style generalization for unseen speakers possible [ZSvNC21]. We take inspiration from these methods and adapt their ideas to gesture generation to inherit their advantages.…”

Section: Related Workmentioning

confidence: 99%

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Ghorbani

Ferstl

Holden

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. Our code and data are publicly available at https:// github.com/ ubisoft/ ubisoft-laforge-ZeroEGGS.

show abstract

Section: Related Workmentioning

confidence: 99%

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Ghorbani

Ferstl

Holden

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…The architecture, shown in Fig. 3b, is inspired from [ZSvNC21] which was originally proposed for speech prosody encoding. First, the sequence of animation features, A, are passed through two 1D convolution layers each followed by a ReLU and a layer normalization layer.…”

Section: Speech Encodermentioning

confidence: 99%

“…Positional encoding encourages the model to encode the sequence ordering [VSP * 17]. Then, similarly to [ZSvNC21], we apply a Feed-Forward Transformer block that implements a multi-head self-attention layer [VSP * 17] and two 1D convolution layers each followed by a residual connection and layer normalization. This results in a sequence of shape M × 2De.…”

Section: Speech Encodermentioning

confidence: 99%

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Ghorbani¹,

Ferstl²,

Holden³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. Our code is publicly available at https://github.com/ubisoft/ubisoft-laforge-ZeroEGGS.

show abstract

“…Explicit prosody modeling addresses both the low prosodic variance and the lack of prosody control in end-toend TTS. One technique is prosody transfer (PT) which was first described in [8] and has subsequently been widely studied [8][9][10][11][12][13][14][15]. PT models are trained to transfer prosody from a reference to a target utterance.…”

Section: Introductionmentioning

confidence: 99%

“…PT models have been shown to capture prosodic information and to have the capacity to produce prosodic variance beyond that found in the training data [8,14,15]. They are therefore capable of synthesizing emotive and expressive speech and, to some extent, selecting a target prosodic rendition, simply by using an appropriate reference.…”

Section: Introductionmentioning

confidence: 99%

Do Prosody Transfer Models Transfer Prosody?

Thor¹,

King²

2023

Preprint

View full text Add to dashboard Cite

Some recent models for Text-to-Speech synthesis aim to transfer the prosody of a reference utterance to the generated target synthetic speech. This is done by using a learned embedding of the reference utterance, which is used to condition speech generation. During training, the reference utterance is identical to the target utterance. Yet, during synthesis, these models are often used to transfer prosody from a reference that differs from the text or speaker being synthesized.To address this inconsistency, we propose to use a different, but prosodically-related, utterance during training too. We believe this should encourage the model to learn to transfer only those characteristics that the reference and target have in common. If prosody transfer methods do indeed transfer prosody they should be able to be trained in the way we propose. However, results show that a model trained under these conditions performs significantly worse than one trained using the target utterance as a reference. To explain this, we hypothesize that prosody transfer models do not learn a transferable representation of prosody, but rather a utterance-level representation which is highly dependent on both the reference speaker and reference text.

show abstract

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Cited by 3 publications

References 11 publications

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Do Prosody Transfer Models Transfer Prosody?

Contact Info

Product

Resources

About