Improving TTS with Corpus-Specific Pronunciation Adaptation

Tahon, Marie; Qader, Raheel; Lecorvé, Gwénolé; Lolive, Damien

doi:10.21437/interspeech.2016-864

Cited by 4 publications

(12 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A perceptual study [40] has shown that samples synthesized with the target pronunciation were preferred to those synthesized with the canonical pronunciation. Also, the adaptation of the canonical pronunciation to the voice corpus has shown a clear preference in terms of quality [4]. However, it seems that the generation of spontaneous speech requires some compromises between intelligibility and quality [41].…”

Section: Studies On Pronunciation Variants Modellingmentioning

confidence: 99%

“…The emotional P2P system should fit pretty well with emotional pronunciation, thus increasing the expressivity of output speech samples, but will probably overfit the data. Moreover, if this set-up is not adapted to the voice corpus, then inconsistencies between the corpus used for synthesis and the corpus used for pronunciation remain, lowering the TTS quality [4]. Fig.…”

Section: Exp Single Adaptation Protocolmentioning

confidence: 99%

“…The corresponding set of 60 features presented in Table 2 is inspired from [41]. It has been enriched and adapted to French in [4]. Previous and next words are added in the feature set.…”

Section: Featuresmentioning

confidence: 99%

“…1b. In the applied method already presented in [4], features are selected separately for each of the four groups of features (linguistic, phonological, articulatory and prosodic) reported in Table 2. For each group of features, three symmetric phoneme window sizes are tested (W 0 : current phoneme only, W 1 : current, previous and next phonemes, W 2 : current, 2 previous and 2 next phonemes).…”

Section: Feature Selection Protocolmentioning

confidence: 99%

“…In the present article, grapheme sequences are automatically converted into canonical phoneme sequences using a rule-based phonetizer. Starting from a pronunciation adaptation method originally developed to improve the perceived quality in TTS [4], expressive pronunciation models are trained to adapt canonical neutral pronunciations to target emotional pronunciations as transcribed in an emotional pronunciation corpus. Different aspects of the automatic adaptation framework are studied and evaluated at the phonetic level.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Can We Generate Emotional Pronunciations for Expressive Speech Synthesis?

Tahon

Lecorvé

Lolive

2020

IEEE Trans. Affective Comput.

Self Cite

View full text Add to dashboard Cite

Abstract-In the field of expressive speech synthesis, a lot of work has been conducted on suprasegmental prosodic features while few has been done on pronunciation variants. However, prosody is highly related to the sequence of phonemes to be expressed. This article raises two issues in the generation of emotional pronunciations for TTS systems. The first issue consists in designing an automatic pronunciation generation method from text, while the second issue addresses the very existence of emotional pronunciations through experiments conducted on emotional speech. To do so, an innovative pronunciation adaptation method which automatically adapts canonical phonemes first to those labeled in the corpus used to create a synthetic voice, then to those labeled in an expressive corpus, is presented. This method consists in training conditional random fields pronunciation models with prosodic, linguistic, phonological and articulatory features. The analysis of emotional pronunciations reveals strong dependencies between prosody and phoneme assimilation or elisions. According to perceptual tests, the double adaptation allows to synthesize expressive speech samples of good quality, but emotion-specific pronunciations are too subtle to be perceived by testers.

show abstract

Section: Studies On Pronunciation Variants Modellingmentioning

confidence: 99%

Section: Exp Single Adaptation Protocolmentioning

confidence: 99%

“…The corresponding set of 60 features presented in Table 2 is inspired from [41]. It has been enriched and adapted to French in [4]. Previous and next words are added in the feature set.…”

Section: Featuresmentioning

confidence: 99%

Section: Feature Selection Protocolmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations