Peter Rutten scite author profile

Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker?A unit selection approach is used to produce a lazy learning model of three English RP speaker's f0 and durational parameters. Speaker 1 (the target speaker) had a much smaller database (approximately one quarter to one fifth the size) of the other two. Speaker 2 was a female speaker with frequent mid phrase rises. Speaker 3 was a male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text.We apply the models created for speaker 2 (an inappropriate model) and speaker 3 (an appropriate model) to speaker 1 and compare the results. Three passages (of three to four sentences in length) from challenging prosodic genres (news report, poetry and personal email) were synthesised using the target speaker and each of the three models. The synthesised utterances were played to 15 native english subjects and rated using a 5 point MOS scale. In addition, 7 experienced speech engineers rated each word for errors on a three point scale: 1. Acceptable, 2. Poor, 3. Unacceptable.The results suggest that a large model from an appropriate speaker does not sound more natural or produce fewer errors than a smaller model generated from the individual speaker's own data. In addition it shows that an inappropriate model does produce both less natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and that evaluation techniques for both error rate and naturalness need to improve.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Peter Rutten

Segment selection in the L&h Realspeak laboratory TTS system

The application of interactive speech unit selection in TTS systems

Issues in corpus based speech synthesis

A statistically motivated database pruning technique for unit selection synthesis

My voice, your prosody: sharing a speaker specific prosody model across speakers in unit selection TTS

Contact Info

Product

Resources

About