Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Taylor, Jason R.; Maguer, Sébastien Le; Richmond, Korin

doi:10.21437/ssw.2021-34

Cited by 2 publications

(1 citation statement)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that letterto-sound (L2S) front-ends have poor performance on lexicons: [14] report 4.6% Phoneme Error Rate (PER) vs. 19.88% Word Error Rate (WER) on the CMUDict dataset using Token-Level Ensemble Distillation, while [15] report similar performance on CMUDict, Pronlex and NetTalk using encoder-decoder models. L2S front-ends for phonetic-to-speech synthesis are thus likely to produce similar PER as implicit L2S conversion reported by end-to-end TTS for English [1] or French [16].…”

Section: Tts and Phonological Variationsmentioning

confidence: 69%

Advocating for text input in multi-speaker text-to-speech systems

Bailly,

Lenglet,

Perrotin

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

Nowadays text-to-speech synthesis (TTS) systems are most commonly trained using phonetic input. This is mostly due to the poor performance of the letter-to-sound (L2S) mapping (in particular with languages with opaque orthography) performed by end-to-end TTS: the empirical distribution of the words sampled in the sole training corpus cannot compete with pronunciation dictionaries. Taylor and Richmond [1] actually reported letter-to-sound errors -implicitly performed by end-to-end systems from raw text input -close to 10%.This paper nevertheless shows that speakers produce lawful phonological variations and that end-to-end TTS systems trained to accept text input -once trained adequately -can capture these variations of pronunciation that are strong markers of sociolinguistic features. We illustrate such variations on liaisons and schwas in French and r-linking in British English. We therefore advocate for restoring text input for TTS, so that the many aspects of style variations (produced by speakers as well as stylistic variations) encoded by suprasegmental features can also be reflected in actual variations of pronunciation.

show abstract

Section: Tts and Phonological Variationsmentioning

confidence: 69%