11th ISCA Speech Synthesis Workshop (SSW 11) 2021
DOI: 10.21437/ssw.2021-34
|View full text |Cite
|
Sign up to set email alerts
|

Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Abstract: Sequence-to-sequence (S2S) TTS models like Tacotron have grapheme-only inputs when trained fully end-to-end. Grapheme inputs map to phone sounds depending on context, which traditionally is handled by extensive preprocessing in the TTS front-end. However, French orthography does not provide a clear one-to-one mapping between graphemes and sounds, and in English, which similarly has rather non-phonetic orthography, pronunciations are a significant cause of error in S2S-TTS with grapheme-inputs. In this paper, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 23 publications
0
0
0
Order By: Relevance
“…Note that letterto-sound (L2S) front-ends have poor performance on lexicons: [14] report 4.6% Phoneme Error Rate (PER) vs. 19.88% Word Error Rate (WER) on the CMUDict dataset using Token-Level Ensemble Distillation, while [15] report similar performance on CMUDict, Pronlex and NetTalk using encoder-decoder models. L2S front-ends for phonetic-to-speech synthesis are thus likely to produce similar PER as implicit L2S conversion reported by end-to-end TTS for English [1] or French [16].…”
Section: Tts and Phonological Variationsmentioning
confidence: 69%
“…Note that letterto-sound (L2S) front-ends have poor performance on lexicons: [14] report 4.6% Phoneme Error Rate (PER) vs. 19.88% Word Error Rate (WER) on the CMUDict dataset using Token-Level Ensemble Distillation, while [15] report similar performance on CMUDict, Pronlex and NetTalk using encoder-decoder models. L2S front-ends for phonetic-to-speech synthesis are thus likely to produce similar PER as implicit L2S conversion reported by end-to-end TTS for English [1] or French [16].…”
Section: Tts and Phonological Variationsmentioning
confidence: 69%