Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-864
|View full text |Cite
|
Sign up to set email alerts
|

Improving TTS with Corpus-Specific Pronunciation Adaptation

Abstract: Text-to-speech (TTS) systems are built on speech corpora which are labeled with carefully checked and segmented phonemes. However, phoneme sequences generated by automatic grapheme-to-phoneme converters during synthesis are usually inconsistent with those from the corpus, thus leading to poor quality synthetic speech signals. To solve this problem, the present work aims at adapting automatically generated pronunciations to the corpus. The main idea is to train corpusspecific phoneme-to-phoneme conditional rand… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
3
1

Relationship

3
1

Authors

Journals

citations
Cited by 4 publications
(12 citation statements)
references
References 18 publications
0
12
0
Order By: Relevance
“…A perceptual study [40] has shown that samples synthesized with the target pronunciation were preferred to those synthesized with the canonical pronunciation. Also, the adaptation of the canonical pronunciation to the voice corpus has shown a clear preference in terms of quality [4]. However, it seems that the generation of spontaneous speech requires some compromises between intelligibility and quality [41].…”
Section: Studies On Pronunciation Variants Modellingmentioning
confidence: 99%
See 4 more Smart Citations
“…A perceptual study [40] has shown that samples synthesized with the target pronunciation were preferred to those synthesized with the canonical pronunciation. Also, the adaptation of the canonical pronunciation to the voice corpus has shown a clear preference in terms of quality [4]. However, it seems that the generation of spontaneous speech requires some compromises between intelligibility and quality [41].…”
Section: Studies On Pronunciation Variants Modellingmentioning
confidence: 99%
“…The emotional P2P system should fit pretty well with emotional pronunciation, thus increasing the expressivity of output speech samples, but will probably overfit the data. Moreover, if this set-up is not adapted to the voice corpus, then inconsistencies between the corpus used for synthesis and the corpus used for pronunciation remain, lowering the TTS quality [4]. Fig.…”
Section: Exp Single Adaptation Protocolmentioning
confidence: 99%
See 3 more Smart Citations