ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054556
|View full text |Cite
|
Sign up to set email alerts
|

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Abstract: Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
82
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 99 publications
(82 citation statements)
references
References 17 publications
0
82
0
Order By: Relevance
“…Tacotron-GST [24] proposed modeling speech style using global style token (GST) by adding style token layer that consumes the reference encoder outputs [23] using a multihead attention scheme [43]. Recently, Mellotron [25] combined GST, pitch, and rhythm for style transferring and successfully reduced F0 frame error (FFE) significantly between synthesized audio and reference audio.…”
Section: A End-to-end Dnn-based Ttsmentioning
confidence: 99%
See 4 more Smart Citations
“…Tacotron-GST [24] proposed modeling speech style using global style token (GST) by adding style token layer that consumes the reference encoder outputs [23] using a multihead attention scheme [43]. Recently, Mellotron [25] combined GST, pitch, and rhythm for style transferring and successfully reduced F0 frame error (FFE) significantly between synthesized audio and reference audio.…”
Section: A End-to-end Dnn-based Ttsmentioning
confidence: 99%
“…The encoder-decoder network can also be called spectogram prediction network that predicts spectrogram output from text input. The entire proposed multilingual multi-speaker TTS model, illustrated in Figure 1, is a sequence-to-sequence (seq-to-seq) Tacotron-2 network [13] with some additions: style embedding as in [24], pitch contour and attention map as in [25], language embedding, and speaker embedding. These additional networks are for handling multilingual, multi-speaker, and transfer of speaking style, pitch, and rhythm from a reference audio.…”
Section: A Model Architecturesmentioning
confidence: 99%
See 3 more Smart Citations