2022
DOI: 10.36244/icj.2022.3.7
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Abstract: Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular al… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 15 publications
(24 reference statements)
0
2
0
Order By: Relevance
“…Csapó et al have extensively explored the role of prosodic variability methods in a corpus-based unit selection text-to-speech system [31], and have worked on enhancing the naturalness of synthesized speech [32]. More recently, Mandeel et al [33] demonstrate successful speaker adaptation experiments using Tacotron2, a state-of-the-art text-to-speech synthesis system.…”
Section: Speaker Adaptation In Text-to-speech Synthesismentioning
confidence: 99%
“…Csapó et al have extensively explored the role of prosodic variability methods in a corpus-based unit selection text-to-speech system [31], and have worked on enhancing the naturalness of synthesized speech [32]. More recently, Mandeel et al [33] demonstrate successful speaker adaptation experiments using Tacotron2, a state-of-the-art text-to-speech synthesis system.…”
Section: Speaker Adaptation In Text-to-speech Synthesismentioning
confidence: 99%
“…This study investigated and adapted many postfilter architectures with minimal data. Using the TTS model (Tacotron2), it was found that five minutes of the target speaker's adaptation data with a low training time of checkpoint 900 (an iteration point in the training process) is enough to have a reasonable synthesized speech quality [55]. Moreover, a meta-learning algorithm was applied to the speaker adaptation method to increase the target speaker similarity and decrease the adaptation data [56].…”
Section: Speaker Adaptationmentioning
confidence: 99%