2020
DOI: 10.1109/lsp.2020.3016564
|View full text |Cite
|
Sign up to set email alerts
|

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Abstract: Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 18 publications
(4 citation statements)
references
References 34 publications
0
4
0
Order By: Relevance
“…The test of their proposed approach on native speakers reveals that it outperforms the baseline system educated without prosodic annotation. To enhance the prosodic phrasing of the Tacotronbased TTSmodel [36] , suggested the novel two-task learning scheme. For Chinese, they used TH-CoSS (TsingHua Corpus of Speech Synthesis).…”
Section: Related Workmentioning
confidence: 99%
“…The test of their proposed approach on native speakers reveals that it outperforms the baseline system educated without prosodic annotation. To enhance the prosodic phrasing of the Tacotronbased TTSmodel [36] , suggested the novel two-task learning scheme. For Chinese, they used TH-CoSS (TsingHua Corpus of Speech Synthesis).…”
Section: Related Workmentioning
confidence: 99%
“…With the advent of deep learning, end-to-end generative TTS models simplify the synthesis pipeline with a single neural network. Tacotron-based neural TTS [6,7] and its variants [8][9][10] are such examples. In these techniques, the key idea is to integrate the conventional TTS pipeline into a unified encoder-decoder network and to learn the mapping directly from the <text, wav> pair.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, sequence-to-sequence learning based on encoderdecoder model has been successfully applied to a wide range of tasks, including neural machine translation [5][6][7], automatic speech recognition [8][9][10], text-to-speech [11][12][13], phrase break prediction [14,15], and grapheme-to-phoneme conversion [16][17][18]. Such models learn a direct mapping between the variable lengths of sequence pairs, and directly transcribes the graphemes to phonemes without requiring predefined alignment.…”
Section: Introductionmentioning
confidence: 99%