Problem statement: The flexible bit-rate speech coder plays an important role in the modern speech communication. The MP-CELP speech coder which is a candidate of the MPEG4 natural speech coder supports a flexible and wide bit-rate range. However, a fine scalability had not been included. To support finer scalability of the coding rate, it had been studied in this study. Approach: In this study, based on the MP-CELP speech coding with HPDR technique, Fine Granularity Scalability was introduced by adjusting the amount of transmitted fixed excitation information. The FGS feature aim at changing the bit rate of the conventional coding more finely and more smoothly. Results: Through performance analysis and computer simulation, the quality of scalability of the MP-CELP coding was presented with an improvement from conventional scalable MP-CELP. The HPDR technique is also applied to the MP-CELP to use for tonal language, meanwhile it can support the core coding rate of 4.2, 5.5, 7.5 kbps and additional scaled bit rates. Conclusion: The core coder with high pitch delay resolution technique and adaptive codebook for tonal speech quality improvement has been conducted and the FGS brings about further efficient scalability
Problem statement:In HMM-based Thai speech synthesis, tone is an important issue that brings about the intelligibility of the synthesized speech. Tone distortion resulted from imbalance of the training data should be appropriately treated. Approach: This study described an HMM-based speech synthesis system for Thai language. In the system, spectrum, pitch and state duration are modeled simultaneously in a unified framework of HMM, their parameter distributions are clustered independently by using a decision-tree based context clustering technique. The contextual factors which affect spectrum, pitch and duration, i.e., part of speech, position and number of phones in a syllable, position and number of syllables in a word, position and number of words in a sentence, phone type and tone type, are taken into account for constructing the questions of the decision tree. Since Thai is a tonal language, tone questions play an important role in the context clustering process. Results: An experimental result compared F0 contours between those of synthesized speech with and without tone questions; furthermore the size of Thai speech corpus is varied to investigate the synthesized speech quality. Conclusion: By using the tone questions in the tree-based context clustering process, the tone distortion is relieved significantly.
Problem statement:In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental Frequency (F0) contours accurately to preserve the speech prosody. Approach: Therefore this study proposes an analysis of model parameters for Thai speech prosody with three speaking styles and two genders which is a preliminary work for speech synthesis. Fujisaki's modeling; a powerful tool to model the F0 contour has been adopted, while the speaking styles of happiness, sadness and reading have been considered. Seven derived parameters from the Fujisaki's model are as follows. The first parameter is baseline frequency which is the lowest level of F0 contour. The second and third parameters are the numbers of phrase commands and tone commands which reflect the frequencies of surges of the utterance in global and local levels, respectively. The fourth and fifth parameters are phrase command and tone command durations which reflect the speed of speaking and the length of a syllable, respectively. The sixth and seventh parameters are amplitudes of phrase command and tone command which reflect the energy of the global speech and the energy of local syllable. Results: In the experiments, each speaking style includes 200 samples of one sentence with male and female speech. Therefore our speech database contains 1200 utterances in total. The results show that most of the proposed parameters can distinguish three kinds of speaking styles explicitly. Conclusion: From the finding, it is a strong evidence to further apply the successful parameters in the speech synthesis systems or other speech processing technologies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.