The primary stud y of this paper is focused on the acoustic module (AM) design in order to improve the performance of Mandarin TTS system. The AM is composed of the prosody generator, the spectrum generator, and the speech synthesizer. The HMM, recurrent neural network (RNN), and PSOLA algorithms are employed to build the AM. Finally, the performance analyses including the speech quality, memory requirement, and computational complexity are examined in our system. Smaller than 2.4 MB memory space and average 0.08 MIPS for each syllable can be achieved on the fixed-point DSP chip. Also the synthesized speech sounds very good.
I. IN TROD UCTIONA general TTS system includes text analysis, prosody generator, synthesis unit generator, and speech synthesizer. Text analysis is first invoked to resolve the input text syntactically and/or semantically to extract linguistic features. Usually, the work of text analysis needs a lexical dictionary which established by linguist. The prosody generator receives the linguistic features to generate prosodic information such as pitch contour, energy envelope, and duration patterns. The naturalness of synthesized speech is determined by the prosodic generator. The synthesis unit generator produces the most suitable speech template according to the phonetic symbol. To make the synthesized speech more clear is its main goal. Finally, the speech synthesizer adopts prosodic information and synthesis unit, then, the algorithm of prosodic modification is performed on the synthesis unit and outputs the natural speech. The block diagram of a general TTS system is shown in Fig. 1.In the past, much effort was paid to design a TTS with high quality [1]- [6]. However, the naturalness and fluency are two important issues for the ITS system. Thus, most ofresearcher paid their effort on the prosody generator for the TTS system. In the general prosody generator, two problems must be overcome to achieve the natural and fluent speech. One is the suitable model for prosody generator and the other is the suitable linguistic feature for prosody generator. In the first problem, in the past, two approaches included the rule-based and the statistical-based approaches were employed to generate the suitable prosodic information. The rule-based approach [1], [2] used many pronunciation rules inferred by Cheng-Yu Yeh is with the ). 9 7 8 -1 -4 2 4 4 -6 2 8 7 -2/ 1 0/$26 . 0 0 Q2 0 1 0 IEEE Input Text Synthesized Speech Fig. I. Block diagram of a general TIS system.linguist to improve the speech quality for ITS system. The derivation of pronunciation rules, however, is laborious, time wasting and tedious. Furthermore, the cross-influence of pronunciation rules on the prosodic information cannot be easily quantified and inferred as independent rules. Moreover, these pronunciation rules must be inferred from the acoustic expert and the linguist. The statistical approach [3], [4], on the contrary, used the probability model or the neural network to automatically organize and infer the pronunciation rules. The natu...