2021
DOI: 10.48550/arxiv.2108.02271
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Abstract: This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. This improvement is achieved using FiLM conditioning layers, alongside adversarial training that encourages disentanglement between prosodic information and speaker identity. The acoustic model inherits attractive qualities from FastSpeech 2, such as fast inference and local prosody attributes prediction for finer grained control over generation. Experimental results s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(11 citation statements)
references
References 11 publications
(16 reference statements)
0
11
0
Order By: Relevance
“…These models can generalize to new styles, but also allow for interpolation in the style latent space. Other work has even made style generalization for unseen speakers possible [ZSvNC21]. We take inspiration from these methods and adapt their ideas to gesture generation to inherit their advantages.…”
Section: Related Workmentioning
confidence: 99%
“…These models can generalize to new styles, but also allow for interpolation in the style latent space. Other work has even made style generalization for unseen speakers possible [ZSvNC21]. We take inspiration from these methods and adapt their ideas to gesture generation to inherit their advantages.…”
Section: Related Workmentioning
confidence: 99%
“…The architecture, shown in Fig. 3b, is inspired from [ZSvNC21] which was originally proposed for speech prosody encoding. First, the sequence of animation features, A, are passed through two 1D convolution layers each followed by a ReLU and a layer normalization layer.…”
Section: Speech Encodermentioning
confidence: 99%
“…Positional encoding encourages the model to encode the sequence ordering [VSP * 17]. Then, similarly to [ZSvNC21], we apply a Feed-Forward Transformer block that implements a multi-head self-attention layer [VSP * 17] and two 1D convolution layers each followed by a residual connection and layer normalization. This results in a sequence of shape M × 2De.…”
Section: Speech Encodermentioning
confidence: 99%
“…Explicit prosody modeling addresses both the low prosodic variance and the lack of prosody control in end-toend TTS. One technique is prosody transfer (PT) which was first described in [8] and has subsequently been widely studied [8][9][10][11][12][13][14][15]. PT models are trained to transfer prosody from a reference to a target utterance.…”
Section: Introductionmentioning
confidence: 99%
“…PT models have been shown to capture prosodic information and to have the capacity to produce prosodic variance beyond that found in the training data [8,14,15]. They are therefore capable of synthesizing emotive and expressive speech and, to some extent, selecting a target prosodic rendition, simply by using an appropriate reference.…”
Section: Introductionmentioning
confidence: 99%