Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Valle, Rafael F.; Li, Jason; Prenger, Ryan; Catanzaro, Bryan

doi:10.1109/icassp40776.2020.9054556

Cited by 99 publications

(82 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tacotron-GST [24] proposed modeling speech style using global style token (GST) by adding style token layer that consumes the reference encoder outputs [23] using a multihead attention scheme [43]. Recently, Mellotron [25] combined GST, pitch, and rhythm for style transferring and successfully reduced F0 frame error (FFE) significantly between synthesized audio and reference audio.…”

Section: A End-to-end Dnn-based Ttsmentioning

confidence: 99%

“…The encoder-decoder network can also be called spectogram prediction network that predicts spectrogram output from text input. The entire proposed multilingual multi-speaker TTS model, illustrated in Figure 1, is a sequence-to-sequence (seq-to-seq) Tacotron-2 network [13] with some additions: style embedding as in [24], pitch contour and attention map as in [25], language embedding, and speaker embedding. These additional networks are for handling multilingual, multi-speaker, and transfer of speaking style, pitch, and rhythm from a reference audio.…”

Section: A Model Architecturesmentioning

confidence: 99%

“…To carry out finer and detailed control, we add networks to condition melodic information such as pitch and rhythm. In addition to GST network, we adopt scheme in [25] to explicitly model expressive speech variables, such as fundamental frequency contour (F0) or pitch, and voicing decision (voiced/unvoiced), and rhythm variables. The pitch contour is extracted using the YIN algorithm [46] with a harmony threshold between 0.1 and 0.25 from the reference audio.…”

Section: Pitch Contourmentioning

confidence: 99%

“…By learning the alignment map during training, we can control the rhythm during inference. Alignment map is extracted using a forcedaligner from <reference audio, transcription> pair data, as in [25]. TTS can produce the same rhythm as the reference audio using the extracted alignment map.…”

Section: Rhythmmentioning

confidence: 99%

“…Besides producing high-quality synthetic sounds, DNN-based TTS introduces many possibilities to produce speech in various types of sounds, speech styles, and emotional states. E2E-prosody [23], Tacotron-GST [24], and Mellotron [25] proposed a DNN-based prosody model that has an important role in transferring a reference speaking style when generating synthesized speech. These works showed satisfactory results.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

This work applies a hierarchical transfer learning to implement deep neural network (DNN)based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this paper, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio. INDEX TERMS deep neural network, hierarchical transfer learning, low-resource, multi-speaker, multilingual, style transfer, text-to-speech

show abstract

Section: A End-to-end Dnn-based Ttsmentioning

confidence: 99%

Section: A Model Architecturesmentioning

confidence: 99%

Section: Pitch Contourmentioning

confidence: 99%

Section: Rhythmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

2020

View full text Add to dashboard Cite

show abstract

Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Christidou¹,

Vioni²,

Ellinas³

et al. 2021

Speech and Computer

View full text Add to dashboard Cite

Deep Dive Speech Technology

Skowron,

Backfried,

Navas

et al. 2023

Cognitive Technologies

View full text Add to dashboard Cite

This chapter provides an in-depth account of current research activities and applications in the field of Speech Technology (ST). It discusses technical, scientific, commercial and societal aspects in various ST sub-fields and relates ST to the wider areas of Natural Language Processing and Artificial Intelligence. Furthermore, it outlines breakthroughs needed, main technology visions and provides an outlook towards 2030 as well as a broad view of how ST may fit into and contribute to a wider vision of Deep Natural Language Understanding and Digital Language Equality in Europe. The chapter integrates the views of several companies and institutions involved in research and commercial application of ST.

show abstract

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Cited by 99 publications

References 17 publications

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Deep Dive Speech Technology

Contact Info

Product

Resources

About