Guanglai Gao scite author profile

While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-ofdomain test data both in Chinese and English systems.

show abstract

Fractal property of generalized M-set with rational number exponent

Liu

Cheng

Lan

et al. 2013

Applied Mathematics and Computation

View full text Add to dashboard Cite

Expressive TTS Training with Frame and Style Reconstruction Loss

Liu

Şişman

Gao

et al. 2020

Preprint

View full text Add to dashboard Cite

We propose a novel training strategy for Tacotronbased text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. Our proposed idea marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. The proposed training strategy adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The proposed style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms a state-of-the-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.

show abstract

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Liu

Şişman

Bao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Prosodic phrasing is an important factor that affects naturalness and intelligibility in text-to-speech synthesis. Studies show that deep learning techniques improve prosodic phrasing when large text and speech corpus are available. However, for low-resource languages, such as Mongolian, prosodic phrasing remains a challenge for various reasons. First, the database suitable for system training is limited; Second, word composition knowledge that is prosody-informing has not been used in prosodic phrase modeling. To address these problems, in this paper, we propose a feature augmentation method in conjunction with a self-attention neural classifier. We augment input text with morphological and phonological decompositions of words to enhance the text encoder. We study the use of self-attention classifier, that makes use of global context of a sentence, as a decoder for phrase break prediction. Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Guanglai Gao

Training Supervised Speech Separation System to Improve STOI and PESQ Directly

Teacher-Student Training For Robust Tacotron-Based TTS

Fractal property of generalized M-set with rational number exponent

Expressive TTS Training with Frame and Style Reconstruction Loss

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Contact Info

Product

Resources

About