Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

Tahon, Marie; Qader, Raheel; Lecorvé, Gwénolé; Lolive, Damien

doi:10.1007/978-3-319-45925-7_9

Cited by 2 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In their previous work [4], [43], the authors have presented the training process of a voice-specific P2P model with the corpus TelecomVo training subcorpus. A first set of 15 features including linguistic, phonological and prosodic features with a W 2 window, was automatically selected.…”

Section: Voice-specific P2p Modelmentioning

confidence: 99%

“…Except alphabet mapping, four types of phoneme confusions have been reported. A lot of pronunciation variants, related to the pronunciation of the speaker itself, are observed for midvowels /ø/, /@/, /e/, /E/, /O/, /o/ (for example, /e/ ↔ /E/ and /o/ ↔ /O/) [43], [38]. The elision of final liquids /K/ and /l/ is also observed in the target pronunciation.…”

Section: Phoneme Confusions Between Stylesmentioning

confidence: 99%

See 1 more Smart Citation

Can We Generate Emotional Pronunciations for Expressive Speech Synthesis?

Tahon

Lecorvé

Lolive

2020

IEEE Trans. Affective Comput.

Self Cite

View full text Add to dashboard Cite

Abstract-In the field of expressive speech synthesis, a lot of work has been conducted on suprasegmental prosodic features while few has been done on pronunciation variants. However, prosody is highly related to the sequence of phonemes to be expressed. This article raises two issues in the generation of emotional pronunciations for TTS systems. The first issue consists in designing an automatic pronunciation generation method from text, while the second issue addresses the very existence of emotional pronunciations through experiments conducted on emotional speech. To do so, an innovative pronunciation adaptation method which automatically adapts canonical phonemes first to those labeled in the corpus used to create a synthetic voice, then to those labeled in an expressive corpus, is presented. This method consists in training conditional random fields pronunciation models with prosodic, linguistic, phonological and articulatory features. The analysis of emotional pronunciations reveals strong dependencies between prosody and phoneme assimilation or elisions. According to perceptual tests, the double adaptation allows to synthesize expressive speech samples of good quality, but emotion-specific pronunciations are too subtle to be perceived by testers.

show abstract

Section: Voice-specific P2p Modelmentioning

confidence: 99%

Section: Phoneme Confusions Between Stylesmentioning

confidence: 99%

Can We Generate Emotional Pronunciations for Expressive Speech Synthesis?

Tahon

Lecorvé

Lolive

2020

IEEE Trans. Affective Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The voice pronunciation model adapts canonical phonemes to phonemes as realized in the speech corpus. In previous work [19,20], we have presented the training process of a P2P voice-specific model with the corpus Telecom. Table 2 shows the distribution of selected features within groups.…”

Section: P2p Voice-specific Pronunciation Modelmentioning

confidence: 99%

“…It was also used to predict a corpus-specific pronunciation, i.e. a pronunciation adapted to the TTS voice corpus, thus conducting to a significant improvement of the overall quality of synthesized speech [19,20]. In the work realized in [19], we manage to synthesize good quality speech samples on a neutral voice.…”

Section: Introductionmentioning

confidence: 99%

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

Tahon

Lecorvé

Lolive

et al. 2017

Statistical Language and Speech Processing

Self Cite

View full text Add to dashboard Cite

Actually a lot of work on expressive speech focus on acoustic models and prosody variations. However, in expressive Text-to-Speech (TTS) systems, prosody generation strongly relies on the sequence of phonemes to be expressed and also to the words below these phonemes. Consequently, linguistic and phonetic cues play a significant role in the perception of expressivity. In previous works, we proposed a statistical corpus-specific framework which adapts phonemes derived from an automatic phonetizer to the phonemes as labelled in the TTS speech corpus. This framework allows to synthesize good quality but neutral speech samples. The present study goes further in the generation of expressive speech by predicting not only corpus-specific but also expressive pronunciation. It also investigates the shared impacts of linguistics, phonetics and prosody, these impacts being evaluated through different French neutral and expressive speech collected with different speaking styles and linguistic content and expressed under diverse emotional states. Perception tests show that expressivity is more easily perceived when linguistics, phonetics and prosody are consistent. Linguistics seems to be the strongest cue in the perception of expressivity, but phonetics greatly improves expressiveness when combined with and adequate prosody.

show abstract

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

Cited by 2 publications

References 14 publications

Can We Generate Emotional Pronunciations for Expressive Speech Synthesis?

Can We Generate Emotional Pronunciations for Expressive Speech Synthesis?

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

Contact Info

Product

Resources

About