Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with BiLSTM Model

IEEE/ACM Trans. Audio Speech Lang. Process.

Bao

et al. 2021

Self Cite

Prosodic phrasing is an important factor that affects naturalness and intelligibility in text-to-speech synthesis. Studies show that deep learning techniques improve prosodic phrasing when large text and speech corpus are available. However, for low-resource languages, such as Mongolian, prosodic phrasing remains a challenge for various reasons. First, the database suitable for system training is limited; Second, word composition knowledge that is prosody-informing has not been used in prosodic phrase modeling. To address these problems, in this paper, we propose a feature augmentation method in conjunction with a self-attention neural classifier. We augment input text with morphological and phonological decompositions of words to enhance the text encoder. We study the use of self-attention classifier, that makes use of global context of a sentence, as a decoder for phrase break prediction. Both objective and subjective evaluations validate the effectiveness of the proposed phrase break prediction framework, that consistently improves voice quality in a Mongolian text-to-speech synthesis system.

Section: Ablation Testsmentioning

confidence: 99%

“…We adopt a self-attention neural classifier, which handles long range dependency of words better than RNN [52]. This work is an extension to our previous work [62] with several novel contributions,…”

Section: Introductionmentioning

confidence: 99%

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Liu

IEEE/ACM Trans. Audio Speech Lang. Process.

Bao

et al. 2021

Self Cite

“…For Chinese, we use the Tencent AI Lab embedding database for Chinese Words and Phrases [41]. For Mongolian, the pre-trained 200-dimension word embedding reported in [42] is used.…”

Section: Experiments a Databasesmentioning

confidence: 99%

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Liu

IEEE Signal Process. Lett.

Bao

et al. 2020

Self Cite

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

“…In these techniques, the key idea is to integrate the conventional TTS pipeline into a unified encoder-decoder network and to learn the mapping directly from the <text, wav> pair. Tacotron is a successful encoder-decoder implementation based on recurrent neural networks (RNN), such as LSTM [11,12] and GRU [13]. However, the recurrent nature inherently limits the possibility of parallel computing in both training and inference.…”

Section: Introductionmentioning

confidence: 99%

Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2021

Self Cite

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.