Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept

Álías, Francesc; Formiga, Lluís; Llorà, Xavier

doi:10.1016/j.specom.2011.01.004

Cited by 12 publications

(2 citation statements)

References 28 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This block retrieves the units that minimise the prosodic, linguistic, and concatenation costs (see [42] for more details). The weights for the prosodic target and concatenation subcosts were perceptually tuned by means of active interactive genetic algorithms for speech synthesis purposes [44].…”

Section: Text-to-speech Subsystemmentioning

confidence: 99%

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Freixes

Álías

Socoró

2019

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, timescale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

show abstract

Section: Text-to-speech Subsystemmentioning

confidence: 99%

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Freixes

Álías

Socoró

2019

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, this method involves less signal processing or no signal processing. Unit selection method is popular attributable to its high intelligibility and naturalness of output speech (Alias et al, 2011). However, demands larger database for better quality (BarraChicote et al, 2010).…”

Section: Comparison Of State-of-the-art Speech Synthesis Systemmentioning

confidence: 99%

Low Footprint High Intelligibility Malay Speech Synthesizer Based on Statistical Data

Yong

Swee²

2014

Journal of Computer Science

View full text Add to dashboard Cite

Speech synthesis plays a pivotal role nowadays. It can be found in various daily applications such as in mobile phones, navigation systems, languages learning software and so on. In this study, a Malay language speech synthesizer was designed using hidden Markov model to improve the performance of current Malay speech synthesizer and also extend Malay speech technology. Statistical parametric method was utilized in this study. The database was constructed to be balanced with all the phonetic sample appeared in Malay language. The results were rated by 48 listeners and obtained a moderate high rating ranging from 3.79 to 4.23 out of 5. The computed Word Error Rate is 7.1%. The total file size is less than 2 Megabytes which means it is suitable to be embedded into daily application. In conclusion, a Malay language speech synthesizer was designed using statistical parametric method with hidden Markov model. The output speech was verified to be good in quality. The file size is small indicates the feasibility to be used in embedded system.

show abstract

Unit Selection Cost Function Exploration Using an A* Based Text-to-Speech System

Guennec

Lolive

2014

Text, Speech and Dialogue

View full text Add to dashboard Cite

Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept

Cited by 12 publications

References 28 publications

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Low Footprint High Intelligibility Malay Speech Synthesizer Based on Statistical Data

Unit Selection Cost Function Exploration Using an A* Based Text-to-Speech System

Contact Info

Product

Resources

About