Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis

Chung, Yu-An; Wang, Yuxuan; Hsu, Wei-Ning; Skerry-Ryan, RJ

doi:10.1109/icassp.2019.8683862

Cited by 100 publications

(97 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, training end-to-end TTS systems requires large quantities of text-audio paired data. In order to improve data efficiency, semi-supervised training framework is proposed for Tacotron [1] by leveraging non-parallel large-scale text and speech resources [12]. Nevertheless, there is little discussion on end-to-end TTS for low-resource languages, where only very limited paired data are available.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Chen²,

Yeh

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

show abstract

Section: Introductionmentioning

confidence: 99%

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Chen²,

Yeh

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…Recently, the end-to-end TTS system, which is trained as an autoregressive manner, such as Tacotron [8], Deep voice [9], is showing better performance than the conventional method. In addition, various follow-up studies are being conducted that have further controllable elements such as prosody, style, etc [10,11], or models that can be trained more efficiently [7,12]. We conducted the study by modifyng the TTS model to suit the SVS task, based on DCTTS [7], which is known to be capable of efficient end-to-end training.…”

Section: Related Workmentioning

confidence: 99%

Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Lee

Choi

Jeon³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the superresolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -local conditioning of text and pitch, and conditional adversarial training -are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.

show abstract

“…Another approach is to pre-train only the decoder by simply removing the encoder [17,19]. This is equivalent to zeroing out the context vector [12], which introduces a mismatch as discussed in Section 2.…”

Section: Comparison With Related Workmentioning

confidence: 99%

Independent Language Modeling Architecture for End-To-End ASR

Pham

Khassanov

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reduction on Mandarin HKUST and English NSC datasets respectively; 2) the proposed architecture works well with external LM and can be generalized to different amount of labelled data.

show abstract

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis

Cited by 100 publications

References 12 publications

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Independent Language Modeling Architecture for End-To-End ASR

Contact Info

Product

Resources

About