The modeling of fundamental frequency (F0) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F0 for Isarn speech synthesis. We propose the F0 model for this based on a recurrent neural network (RNN). Sampled values of F0 are used at the syllable level of continuous Isarn speech combined with their dynamic features to represent supra-segmental properties of the F0 contour. Different architectures of the deep RNNs and different combinations of linguistic features are analyzed to obtain conditions for the best performance. To assess the proposed method, we compared it with several RNN-based baselines. The results of objective and subjective tests indicate that the proposed model significantly outperformed the baseline RNN model that predicts values of F0 at the frame level, and the baseline RNN model that represents the F0 contours of syllables by using discrete cosine transform.
This paper describes the Isarn speech synthesis system, which is a regional dialect spoken in the Northeast of Thailand. In this study, we focus to improve the prosody generation of the system by using the additional context features. In order to develop the system, the speech parameters (Mel-ceptrum and fundamental frequencies of phoneme within different phonetic contexts) were modelled using Hidden Markov Models (HMM). Synthetic speech was generated by converting the input text into context-dependent phonemes. Speech parameters were generated from the trained HMM, according to the context-dependent phonemes, and were then synthesized through a speech vocoder. In this study, systems were trained using three different feature sets: basic contextual features, tonal, and syllable-context features. Objective and subjective tests were conducted to determine the performance of the proposed system. The results indicated that the addition of the syllable-context features significantly improved the naturalness of synthesized speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.