In this paper, new efforts to build text-to-speech synthesis systems (TTS) for Indian languages is presented. The synthesisers are built around both concatenative speech synthesis and statistical parametric speech synthesis frameworks. Text to speech synthesis systems require accurate segmentation. Obtaining accurate segmentation at the phone level is a difficult task. Manual segmentation leads to human errors, while automatic segmentation using statistical approaches (hidden Markov model based approaches) leads to poor boundary information, when the amount of data used for training is small.A group delay based syllable segmentation semi-automatic tool is discussed. The tool is semi-automatic as some of the boundaries obtained are inaccurate and have to be manually corrected. Next, a segmentation algorithm that uses both HMM based segmentation and group delay based segmentation, is used to obtain accurate boundaries automatically.The boundaries obtained are used in the syllable-based synthesiser for unit selection. In the statistical phone-based synthesiser, embedded re estimation is performed at the phone level. Syllable-based and penta-phone based HMMs are used for building the synthesiser. TTS systems for 12 different Indian languages namely Tamil, Hindi, Marathi, Malayalam, Telugu, Rajasthani, Bengali, Odia, Assamese, Ma nipuri, Kannada and Gujarati are built using semi-automatic segmen tation and synthesisers have been built for 7 Indian languages using automatic segmentation. Evaluation of the semi-automatic segmentation systems indicate that the MOS (mean opinion score) is above 3.0 for most of the languages. Pair comparison tests on semi-automatic vs. automatic segmentation show that automatic segmentation is preferred.
Automatic segmentation of speech using embedded reestimation of monophone hidden Markov models (HMMs) followed by forced alignment may not give accurate boundaries. Group delay (GD) processing for refining the boundaries at the syllable level is attempted earlier. This paper aims at exploring vowel onset point (VOP) and vowel offset or end point (VEP) for correcting the boundaries obtained using HMM alignment. HMM models the class information well, however may not detect the exact boundary. In case of VOPs and VEPs, spurious rate or miss rate can be there, but detected boundaries are more accurate. Combining both HMM and VOP/VEP gives improvement in terms of log likelihood scores of forced aligned phoneme boundaries. HMM boundaries are corrected using VOP/VEP and model parameters are reestimated at the syllable level. Results are compared with that of GD based correction and found that overall performance is comparable. Performance for vowels is found to be higher than that of GD based refinement as the refinement in this case is mainly at the vowel boundaries. HMM based speech synthesis systems (HTS) are developed using phone as a basic unit with the proposed segmentation method. Subjective evaluation indicates that there is an improvement in the quality of synthesis.
HMM based speech synthesis (HTS) is a state-ofthe art approach to text-to-speech synthesis. Segmentation of the training data is essential for building any text-to-speech system. Most conventional text-to-speech systems use phones as the basic unit of synthesis and use a speech recogniser to automatically segment the data at the phone level. As Indian languages are low resource languages, accurate transcriptions are difficult to obtain owing to paucity of data. Manual labeling at the phone level is not only laborious but also inaccurate. HMM based flat start segmentation doesn't work well at the sentence level. In this paper we propose an event driven approach to obtain better phone boundaries. Syllable-like events are detected in the speech signal and matched with syllabified transcription of the text. The syllables are converted to phoneme sequences and Baum-Welch embedded re-estimation is restricted to the syllable-level. Subjective evaluations indicate that the proposed system has a lower word error rate compared to that of a conventional system that uses flat start for obtaining phone boundaries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.