Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1788
|View full text |Cite
|
Sign up to set email alerts
|

Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding

Abstract: State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequenceto-sequence Tacotron2 syste… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 15 publications
0
1
0
Order By: Relevance
“…Phonemic transcription is also essential in speech recognition systems, where the models generally learn representations of the speech signal at phone-level (Zeineldeen et al 2020). For TTS systems, the complete lexical annotation of the orthographic transcript is essential, and many recent studies augment the text input with this annotation and, as a result, enhance the naturalness and adequacy of the output speech (Peiró-Lilja and Farrús 2020; Taylor and Richmond 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Phonemic transcription is also essential in speech recognition systems, where the models generally learn representations of the speech signal at phone-level (Zeineldeen et al 2020). For TTS systems, the complete lexical annotation of the orthographic transcript is essential, and many recent studies augment the text input with this annotation and, as a result, enhance the naturalness and adequacy of the output speech (Peiró-Lilja and Farrús 2020; Taylor and Richmond 2020).…”
Section: Introductionmentioning
confidence: 99%