Expressive, Variable, and Controllable Duration Modelling in TTS

Abbas, Syed Ammar; Merritt, Thomas; Moinet, Alexis; Karlapati, Sri; Muszynska, Ewa; Slangen, Simon; Gatti, Elia; Drugman, Thomas

doi:10.21437/interspeech.2022-384

Cited by 4 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bidirectional encoder representations from transformers (BERT) [17], one of the well-known pre-trained language models currently, also shows potential for this task. For example, Futamata et al have introduced features from pre-trained BERT in Japanese phrase break prediction [16], and Abbas et al have taken word-level BERT embeddings as the input of a conventional phrasing model [18].…”

Section: Introductionmentioning

confidence: 99%

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Yang¹,

Koriyama²,

Yuki³

et al. 2023

Preprint

View full text Add to dashboard Cite

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multispeaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embeddings to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multispeaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

show abstract

Section: Introductionmentioning

confidence: 99%

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Yang¹,

Koriyama²,

Yuki³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent developments in TTS research allow for explicit control of specific speech features (e.g. duration [15], [16], duration and pitch [17], etc. ), thus providing the right tools to explicitly control acoustic features associated with emphasis in a voice-agnostic fashion, with no need for targeted recordings or annotations.…”

Section: Introductionmentioning

confidence: 99%

Controllable Emphasis with zero data for text-to-speech

Joly,

Nicolis,

Peterova

et al. 2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by 7.3% and correct testers' identification of the emphasized word in a sentence by 40% on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

show abstract