Feilong Bao scite author profile

Feilong Bao

5Publications

64Citation Statements Received

118Citation Statements Given

How they've been cited

144

How they cite others

134

117

Affiliations

Inner Mongolia University, National University of Mongolia, Kunming University of Science and Technology

Publications

Order By: Most citations

Teacher-Student Training For Robust Tacotron-Based TTS

Liu

Şişman

et al. 2020

View full text Add to dashboard Cite

While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways, the exposure bias problem in the autoregressive models remains an issue to be resolved. The exposure bias problem arises from the mismatch between the training and inference process, that results in unpredictable performance for out-of-domain test data at run-time. To overcome this, we propose a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function. We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model. We then train another Tacotron2-based model as a student model, of which the decoder takes the predicted speech frames as input, similar to how the decoder works during run-time inference. With the distillation loss, the student model learns the output probabilities from the teacher model, that is called knowledge distillation. Experiments show that our proposed training scheme consistently improves the voice quality for out-ofdomain test data both in Chinese and English systems.

show abstract

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Hao

Wen³

et al. 2020

View full text Add to dashboard Cite

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Liu

Şişman

Bao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Prosodic phrasing is an important factor that affects naturalness and intelligibility in text-to-speech synthesis. Studies show that deep learning techniques improve prosodic phrasing when large text and speech corpus are available. However, for low-resource languages, such as Mongolian, prosodic phrasing remains a challenge for various reasons. First, the database suitable for system training is limited; Second, word composition knowledge that is prosody-informing has not been used in prosodic phrase modeling. To address these problems, in this paper, we propose a feature augmentation method in conjunction with a self-attention neural classifier. We augment input text with morphological and phonological decompositions of words to enhance the text encoder. We study the use of self-attention classifier, that makes use of global context of a sentence, as a decoder for phrase break prediction. Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.

show abstract

Mongolian Text-to-Speech System Based on Deep Neural Network

Liu

Bao

Gao

et al. 2018

View full text Add to dashboard Cite

Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with BiLSTM Model

Liu

Bao

Gao

et al. 2018

View full text Add to dashboard Cite

In the speech synthesis systems, the phrase break (PB) prediction is the first and most important step. Recently, the state-of-the-art PB prediction systems mainly rely on word embeddings. However this method is not fully applicable to Mongolian language, because its word embeddings are inadequate trained, owing to the lack of resources. In this paper, we introduce a bidirectional Long Short Term Memory (BiLSTM) model which combined word embeddings with syllable and morphological embedding representations to provide richer and multi-view information which leverages the agglutinative property. Experimental results show the proposed method outperforms compared systems which only used the word embeddings. In addition, further analysis shows that it is quite robust to the Out-of-Vocabulary (OOV) problem owe to the refined word embedding. The proposed method achieves the state-of-the-art performance in the Mongolian PB prediction.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Feilong Bao

Teacher-Student Training For Robust Tacotron-Based TTS

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Mongolian Text-to-Speech System Based on Deep Neural Network

Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with BiLSTM Model

Contact Info

Product

Resources

About