Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Skerry-Ryan, RJ; Battenberg, Eric; Xiao, Ying; Wang, Yuxuan; Stanton, Daisy; Shor, Joel; Weiss, Ron; Clark, Rob; Saurous, Rif A.

doi:10.48550/arxiv.1803.09047

Cited by 41 publications

(66 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the speaker encoding network operates on waveforms, it can be used for zero-shot voice cloning from untranscribed utterances of a target speaker. Additionally, the authors of [1] demonstrate that the synthesis model can be fine-tuned on limited text and audio pairs of a new speaker to improve the speaker similarity of the Expressive Speech Synthesis: Prior works [31,25,24] on expressive speech synthesis focus on models that can be conditioned on text and a latent embedding for style or prosody. During training, the style embeddings are derived using a learnable module called Global Style Tokens (GST), that operates on the target speech for a given phrase and derives a style embedding through attention over a dictionary of learnable vectors.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Several past works have focused on the problem of expressive TTS synthesis by learning latent variables for controlling the style aspects of speech synthesized for a given text [31,24]. Such models are usually trained on a single-speaker expressive speech dataset to learn meaningful latent codes for various style aspects of the speech.…”

Section: Introductionmentioning

confidence: 99%

“…At inference time, the model can be provided independent references for style and speaker encodings to achieve expressive voice cloning. cloned speech.Expressive Speech Synthesis: Prior works[31,25,24] on expressive speech synthesis focus on models that can be conditioned on text and a latent embedding for style or prosody. During training, the style embeddings are derived using a learnable module called Global Style Tokens (GST), that operates on the target speech for a given phrase and derives a style embedding through attention over a dictionary of learnable vectors.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Expressive Neural Voice Cloning

Neekhara,

Hussain,

Dubnov

et al. 2021

Preprint

View full text Add to dashboard Cite

Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference. 1

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Expressive Neural Voice Cloning

Neekhara,

Hussain,

Dubnov

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In [6,7], style tokens were used to model the prosody explicitly. Meanwhile, prosody can also be enriched during prosody transfer, like [8,9,10,11]. The prosody attributes of an entire utterance or segment were extracted with a single latent variable from a reference utterance or segment which were used to control the prosody of the synthesized speech.…”

Section: Introductionmentioning

confidence: 99%

Speech Bert Embedding for Improving Prosody in Neural TTS

Chen

Deng

Wang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous segment of a fixed length in the proposed BERT. The extracted embedding is then used together with the mel-spectrogram to predict the following segment in the TTS decoder. Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech. The objective distortions measured on a single speaker TTS are reduced between the generated speech and original recordings. Subjective listening tests also show that the proposed approach is favorably preferred over the TTS without the BERT prosody embedding module, for both in-domain and out-of-domain applications. For Microsoft professional, single/multiple speakers and the LJ Speaker in the public database, subjective preference is similarly confirmed with the new BERT prosody embedding. TTS demo audio samples are in https://judy44chen.github.io/TTSSpeechBERT/.

show abstract

“…Address to the problem, multi-task speech synthetic methods based on reference audio feature embedding are proposed [69,84,82,86,4], which could synthesize speech with specified text, emotion and speaker identity. However, almost all of these methods need reference audio to synthesize the target speech.…”

Section: Introductionmentioning

confidence: 99%

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Chen

Ming

2021

Preprint

View full text Add to dashboard Cite

Text-to-Speech (TTS) synthesis plays an important role in human-computer interaction. Currently, most TTS technologies focus on the naturalness of speech, namely, making the speeches sound like humans. However, the key tasks of the expression of emotion and the speaker identity are ignored, which limits the application scenarios of TTS synthesis technology. To make the synthesized speech more realistic and

show abstract

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Cited by 41 publications

References 0 publications

Expressive Neural Voice Cloning

Expressive Neural Voice Cloning

Speech Bert Embedding for Improving Prosody in Neural TTS

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Contact Info

Product

Resources

About