Mandarin Prosodic Phrase Prediction based on Syntactic Trees

Zhang, Zhengchen; Wu, Fangming; Yang, Chenyu; Dong, Minghui; Zhou, Fugen

doi:10.21437/ssw.2016-26

Cited by 10 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Text Normalization Rule-based [311], Neural-based [310,223,406,430], Hybrid [432] Word Segmentation [394,444,261] POS Tagging [292,323,221,444,135] Prosody Prediction [50,405,312,186,137,322,277,62,440,210,212,3] Grapheme to Phoneme N-gram [41,24], Neural-based [403,283,33, 320] --Polyphone Disambiguation [441,392,224,295,321,29,257] and then neural networks are leveraged to model text normalization as a sequence to sequence task where the source and target sequences are non-standard words and spoken-form words respectively [310,223,430]. Recently, some works [432] propose to combine the advantages of both rule-based and neural-based models to further improve the performance of text normalization.…”

Section: Task Research Workmentioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models, and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

show abstract

Section: Task Research Workmentioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The text front-end structure of other languages is similar to that of Mandarin. These components are usually modeled by traditional statistical methods, such as syntactic trees [264] and CRF [167] based methods for PSP tasks and dictionary matching based methods [77] for pronunciation prediction tasks. However, these traditional text front-ends often fail to predict correctly in some unusual or complex contexts.…”

Section: Text Front-endmentioning

confidence: 99%

Review of end-to-end speech synthesis technology based on deep learning

Mu¹,

Yang²,

Dong³

2021

Preprint

View full text Add to dashboard Cite

As an indispensable part of modern humancomputer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which has more powerful modeling ability and a simpler pipeline. It mainly consists of three modules: text frontend, acoustic model, and vocoder. This paper reviews the research status of these three parts, and classifies and compares various methods according to their emphasis. Moreover, this paper also summarizes the opensource speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and objective speech quality evaluation method. Finally, some attractive future research directions are pointed out.

show abstract

“…Conventionally, linguistic information including lexical features (e.g., part-of-speech tags) and syntax features (e.g., distance from punctuation) is used for this task. Machine learning methods are used in phrasing models, such as decision tree algorithms [2][3][4][5][6][7][8], hidden Markov models [9][10][11], and conditional random fields [3,12]. Due to the development of natural language processing (NLP) and deep learning technologies, word representations have become the key linguistic feature.…”

Section: Introductionmentioning

confidence: 99%

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Yang¹,

Koriyama²,

Yuki³

et al. 2023

Preprint

View full text Add to dashboard Cite

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multispeaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embeddings to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multispeaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

show abstract

Mandarin Prosodic Phrase Prediction based on Syntactic Trees

Cited by 10 publications

References 11 publications

A Survey on Neural Speech Synthesis

A Survey on Neural Speech Synthesis

Review of end-to-end speech synthesis technology based on deep learning

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Contact Info

Product

Resources

About