Low-Resource Expressive Text-To-Speech Using Data Augmentation

Huybrechts, Goeric; Merritt, Thomas; Comini, Giulia; Perz, Bartek; Shah, Raahil; Lorenzo-Trueba, Jaime

doi:10.1109/icassp39728.2021.9413466

Cited by 31 publications

(22 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This artificially boosts the training data available for the resource-scarce target speaker by leveraging readily available source speaker data. However, we have observed that this solution does not scale to achieve naturalness on par with a full-data model for more expressive voices than those presented in [1].…”

Section: Introductionmentioning

confidence: 86%

“…As in Huybrechts et al [1], the method presented in this paper is based on three main steps: 1) data augmentation, 2) multispeaker TTS and 3) fine-tuning. In this work, we also investigate the addition of a fourth step where we fine-tune the model with a cGAN approach to further improve the audio quality.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…Recent work in Huybrechts et al [1] brings significant improvements to naturalness by combining multi-speaker modelling with data augmentation for the low-resource speaker. This approach uses a Voice Conversion (VC) model [13,14,15,16,17] to transform speech from one speaker to sound like speech from another, while preserving the content and prosody * The first two authors have equal contribution.…”

Section: Introductionmentioning

confidence: 99%

“…To address this limitation, we investigate the most expressive voice in our catalog and propose changes to the model architecture that consistently outperform the approach presented in [1] and achieve naturalness on par or better than a full-data Tacotron2-based [2] model.…”

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 4 more Smart Citations

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Shah¹,

Pokora²,

Ezzerg³

et al. 2021

11th ISCA Speech Synthesis Workshop (SSW 11)

Self Cite

View full text Add to dashboard Cite

Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work [1], a 3step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (≈ 10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a nonautoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Proposed Methodsmentioning

confidence: 99%