CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

Karlapati, Sri; Karanasou, Penny; Mateusz, Lajszczak,; Abbas, Syed Ammar; Moinet, Alexis; Makarov, Peter; Li, Ray; Arent, van Korlaar,; Slangen, Simon; Drugman, Thomas

doi:10.21437/interspeech.2022-367

Cited by 2 publications

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The objective of this paper is to understand how we can provide "narrow focus" word-level emphasis controllability for multiple voices and languages (1) without quality degradation, (2) without annotation, (3) without recordings and (4) if possible without model re-training. While context awareness of TTS system has vastly improved (see [3], [4] among others), automated output does not always assign the correct intonation to cases like (1e), given preceding context . Several commercial TTS system thus allow users to tweak the automated output by manually assigning emphasis (which we use as an umbrella term for narrow or contrastive focus) to a selected word.…”

Section: Introductionmentioning

confidence: 99%

Controllable Emphasis with zero data for text-to-speech

Joly,

Nicolis,

Peterova

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by 7.3% and correct testers' identification of the emphasized word in a sentence by 40% on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

show abstract

Section: Introductionmentioning

confidence: 99%