Efficient text analyser with prosody generator-driven approach for Mandarin text-to-speech

J AUDIO SPEECH MUSIC PROC.

Hwang

2012

In this study, a consistency analysis of energy parameter for Mandarin speech is presented. Identified as a result of inspection of the human pronunciation process, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the hidden Markov model (HMM) algorithm is used first to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Second, based on a designated syllable, the vector quantization (VQ) with the Linde-Buzo-Gray algorithm is used to train the VQ codebooks of each segment. Third, the energy vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the energy warping process intra a syllable must be considered in a text-to-speech system to improve the synthesized speech quality.

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A study on the consistency analysis of energy parameter for Mandarin speech

Shen

J AUDIO SPEECH MUSIC PROC.

Hwang

2012

“…Thus, the suitable linguistic feature driven by the performance of prosody generator is the best policy. In other words, the best linguistic features used to generate the best prosodic information must be inspected and determined by the performance of the prosody generator [6].…”

Section: Introductionmentioning

confidence: 99%

“…The block diagram of a general TTS system is shown in Fig. 1.In the past, much effort was paid to design a TTS with high quality [1]- [6]. However, the naturalness and fluency are two important issues for the ITS system.…”

mentioning

confidence: 99%

See 1 more Smart Citation

The research and implementation of acoustic module based Mandarin TTS

2010 4th International Symposium on Communications, Control and Signal Processing (ISCCSP)

Chen

2010

The primary stud y of this paper is focused on the acoustic module (AM) design in order to improve the performance of Mandarin TTS system. The AM is composed of the prosody generator, the spectrum generator, and the speech synthesizer. The HMM, recurrent neural network (RNN), and PSOLA algorithms are employed to build the AM. Finally, the performance analyses including the speech quality, memory requirement, and computational complexity are examined in our system. Smaller than 2.4 MB memory space and average 0.08 MIPS for each syllable can be achieved on the fixed-point DSP chip. Also the synthesized speech sounds very good. I. IN TROD UCTIONA general TTS system includes text analysis, prosody generator, synthesis unit generator, and speech synthesizer. Text analysis is first invoked to resolve the input text syntactically and/or semantically to extract linguistic features. Usually, the work of text analysis needs a lexical dictionary which established by linguist. The prosody generator receives the linguistic features to generate prosodic information such as pitch contour, energy envelope, and duration patterns. The naturalness of synthesized speech is determined by the prosodic generator. The synthesis unit generator produces the most suitable speech template according to the phonetic symbol. To make the synthesized speech more clear is its main goal. Finally, the speech synthesizer adopts prosodic information and synthesis unit, then, the algorithm of prosodic modification is performed on the synthesis unit and outputs the natural speech. The block diagram of a general TTS system is shown in Fig. 1.In the past, much effort was paid to design a TTS with high quality [1]- [6]. However, the naturalness and fluency are two important issues for the ITS system. Thus, most ofresearcher paid their effort on the prosody generator for the TTS system. In the general prosody generator, two problems must be overcome to achieve the natural and fluent speech. One is the suitable model for prosody generator and the other is the suitable linguistic feature for prosody generator. In the first problem, in the past, two approaches included the rule-based and the statistical-based approaches were employed to generate the suitable prosodic information. The rule-based approach [1], [2] used many pronunciation rules inferred by Cheng-Yu Yeh is with the ). 9 7 8 -1 -4 2 4 4 -6 2 8 7 -2/ 1 0/$26 . 0 0 Q2 0 1 0 IEEE Input Text Synthesized Speech Fig. I. Block diagram of a general TIS system.linguist to improve the speech quality for ITS system. The derivation of pronunciation rules, however, is laborious, time wasting and tedious. Furthermore, the cross-influence of pronunciation rules on the prosodic information cannot be easily quantified and inferred as independent rules. Moreover, these pronunciation rules must be inferred from the acoustic expert and the linguist. The statistical approach [3], [4], on the contrary, used the probability model or the neural network to automatically organize and infer the pronunciation rules. The natu...

Consistency analysis of the spectrum and prosody within a syllable for Mandarin speech

Chen

Math Methods in App Sciences

Hwang

et al. 2013

This work presents a study of Mandarin speech focusing on consistency analysis of the spectrum and prosody within syllables. Identified as a result of inspection of the human pronunciation process, this consistency can be interpreted as a high correlation between the warping curves of the spectrum and the prosody intra a syllable. The consistency analysis consisted of three steps. First, the hidden Markov model algorithm was used to decode the hidden Markov model‐state sequences within a syllable, while at the same time dividing them into three segments. Second, based on a designated syllable, the vector quantization (VQ) with the Linde–Buzo–Gray algorithm was employed to train the VQ codebooks of the prosodic vector of each segment. Third, the prosodic vector of each segment was encoded as an index using the VQ codebooks, and then, to analyze the consistency, the probability of each possible path was evaluated as a prerequisite. Finally, two syllables were used as examples to verify the consistency property found in the experiments. It is demonstrated experimentally that there is definitely consistency in the case where the syllable is located in exactly the same word. These results offer a research direction in that the warping process between the spectrum and the prosody intra a syllable must be considered in text‐to‐speech systems to improve the synthesized speech quality. Copyright © 2013 John Wiley & Sons, Ltd.