ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683862
|View full text |Cite
|
Sign up to set email alerts
|

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis

Abstract: Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contained in large, publiclyavailable text and speech corpora. Importantly, these external data are unpaired and potential… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
97
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 100 publications
(97 citation statements)
references
References 12 publications
0
97
0
Order By: Relevance
“…However, training end-to-end TTS systems requires large quantities of text-audio paired data. In order to improve data efficiency, semi-supervised training framework is proposed for Tacotron [1] by leveraging non-parallel large-scale text and speech resources [12]. Nevertheless, there is little discussion on end-to-end TTS for low-resource languages, where only very limited paired data are available.…”
Section: Introductionmentioning
confidence: 99%
“…However, training end-to-end TTS systems requires large quantities of text-audio paired data. In order to improve data efficiency, semi-supervised training framework is proposed for Tacotron [1] by leveraging non-parallel large-scale text and speech resources [12]. Nevertheless, there is little discussion on end-to-end TTS for low-resource languages, where only very limited paired data are available.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, the end-to-end TTS system, which is trained as an autoregressive manner, such as Tacotron [8], Deep voice [9], is showing better performance than the conventional method. In addition, various follow-up studies are being conducted that have further controllable elements such as prosody, style, etc [10,11], or models that can be trained more efficiently [7,12]. We conducted the study by modifyng the TTS model to suit the SVS task, based on DCTTS [7], which is known to be capable of efficient end-to-end training.…”
Section: Related Workmentioning
confidence: 99%
“…Another approach is to pre-train only the decoder by simply removing the encoder [17,19]. This is equivalent to zeroing out the context vector [12], which introduces a mismatch as discussed in Section 2.…”
Section: Comparison With Related Workmentioning
confidence: 99%