Distribution Augmentation for Low-Resource Expressive Text-To-Speech

Mateusz, Lajszczak,; Prasad, Animesh; Arent, van Korlaar,; Bollepalli, Bajibabu; Bonafonte, Antonio; Joly, Arnaud; Nicolis, Marco; Moinet, Alexis; Drugman, Thomas; Wood, Trevor; Sokolova, Elena

doi:10.1109/icassp43922.2022.9746291

Cited by 4 publications

(1 citation statement)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In parallel, we train a duration model which will predict at inference time the duration of each phoneme given the phoneme sequence. The duration model as in [21,27] consists of a stack of 3 convolution layers with 512 channels, kernel size of 5 and a dropout of 30%, a Bi-LSTM layer and a linear dense layer. To produce speech, we vocode the mel-spectrograms frame using a universal vocoder [28].…”

Section: Non-attentive Tts Architecturementioning

confidence: 99%

Controllable Emphasis with zero data for text-to-speech

Joly,

Nicolis,

Peterova

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by 7.3% and correct testers' identification of the emphasized word in a sentence by 40% on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

show abstract

Section: Non-attentive Tts Architecturementioning

confidence: 99%