ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413466
|View full text |Cite
|
Sign up to set email alerts
|

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Abstract: While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 31 publications
(22 citation statements)
references
References 18 publications
0
21
0
Order By: Relevance
“…This artificially boosts the training data available for the resource-scarce target speaker by leveraging readily available source speaker data. However, we have observed that this solution does not scale to achieve naturalness on par with a full-data model for more expressive voices than those presented in [1].…”
Section: Introductionmentioning
confidence: 86%
See 4 more Smart Citations
“…This artificially boosts the training data available for the resource-scarce target speaker by leveraging readily available source speaker data. However, we have observed that this solution does not scale to achieve naturalness on par with a full-data model for more expressive voices than those presented in [1].…”
Section: Introductionmentioning
confidence: 86%
“…As in Huybrechts et al [1], the method presented in this paper is based on three main steps: 1) data augmentation, 2) multispeaker TTS and 3) fine-tuning. In this work, we also investigate the addition of a fourth step where we fine-tune the model with a cGAN approach to further improve the audio quality.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 3 more Smart Citations