The generation of prosodic parameters such as FO contour, duration and intensity still remains an important issue for naturally-sounding text-to-speech ('ITS), although recently developed 'ITS systems have achieved a considerable pro gress. Several appropriate but language-specific rule-based, statistical or data-driven prosody models have been success fully realized in many systems. The language and param eter dependent models lead to a more complex and inef ficient TTS system design. In earlier works the authors proposed a hybrid data-driven and rule-based model, which can adjust different voices or speaking styles by learning and predicting proSodic parameters. The curr ent paper dis cusses the multilingual model generalization and the de sign of appropriate prosodic databases. Exemplary, two dif ferent languages: German and Mandarin Chinese are ex amined. Prediction results and perceptual evaluation with respect to FO contours and duration values are presented Since the perceptual results of both languages are compara ble and quite satisfying, the model is qualified for the multi lingual prosody control. Resynthesis stimuli obtained from modified prosodic parameters partly achieve ncar-to-natural mean opinion scores (MOS) above 4.0. The introduced hy brid data-driven and rule-based model is comparatively sim ple and enables a multilingual prosody control in ITS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.