F0 Parameterization of Glottalized Tones in HMM-Based Speech Synthesis for Hanoi Vietnamese

Ninh, Duy Khanh; Yamashita, Yoichi

doi:10.1587/transinf.2015edp7134

Cited by 2 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For glottalized tones such as Broken and Drop tones ("Thanh ngã" and "Thanh nặng" in Vietnamese) and for some creaky voices, particularly those of Northern Vietnamese speakers, it is difficult to extract complete and accurate F0 contours from speech signal due to large variations of the signal's degree of periodicity. Thus the F0 extraction method proposed in [6] was employed in our system to alleviate this problem. Besides, we used the high-quality speech vocoding method STRAIGHT to extract spectral and aperiodicity measurements from speech signals as described in the Nitech-HTS 2005 system [14].…”

Section: Extracting Speech Parametersmentioning

confidence: 99%

“…A couple of HMM-based Text-to-Speech (TTS) systems for Vietnamese have been developed since 2009 [2], [3]. Latest refinements being made to these systems involved in the integration of syntactic information and intonational tags to improve the overall naturalness of generated prosody [4], [5] or the accurate extraction of pitch contours for glottalized tones to enhance the tonal analysis and synthesis [6]. Although the obtained results are promising, all of the above systems are built using the speaker-dependent approach with a moderate amount of training data of one speaker.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluation of speaker-dependent and average-voice Vietnamese statistical speech synthesis systems

Khánh¹

2019

jst

View full text Add to dashboard Cite

This paper describes the development and evaluation of a Vietnamese statistical speech synthesis system using the average voice approach. Although speaker-dependent systems have been applied extensively, no average voice based system has been developed for Vietnamese so far. We have collected speech data from several Vietnamese native speakers and employed state-of-the-art speech analysis, model training and speaker adaptation techniques to develop the system. Besides, we have performed perceptual experiments to compare the quality of speaker-adapted (SA) voices built on the average voice model and speaker-dependent (SD) voices built on SD models, and to confirm the effects of contextual features including word boundary (WB) and part-of-speech (POS) on the quality of synthetic speech. Evaluation results show that SA voices have significantly higher naturalness than SD voices when the same limited contextual feature set excluding WB and POS is used. In addition, SA voices trained with limited contextual features excluding WB and POS still have better quality than SD voices trained with full contextual features including WB and POS. These results show the robustness of the average voice method over the speaker-dependent approach for Vietnamese statistical speech synthesis.

show abstract

Section: Extracting Speech Parametersmentioning

confidence: 99%