Exploring Transfer Learning for Low Resource Emotional TTS

Tits, Noé; Haddad, Kevin El; Dutoit, Thierry

doi:10.1007/978-3-030-29516-5_5

Cited by 25 publications

(27 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The rationale behind this idea is that fine-tuning guides the model to focus on the space that matters the most. Unlike many existing low-resource TTS fine-tuning techniques [7,10,13], the target data is here already present in the so called pre-training step, making our fine-tuning step more of a refinement step.…”

Section: Fine-tuningmentioning

confidence: 99%

See 1 more Smart Citation

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Huybrechts

Merritt

Comini

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings. First, we augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model. Finally, we fine-tune that model to further increase quality. Our evaluations show that the proposed changes bring significant improvements over non-augmented models across many perceived aspects of synthesised speech. We demonstrate the proposed approach on 2 styles (newscaster and conversational), on various speakers, and on both single and multi-speaker models, illustrating the robustness of our approach. 1

show abstract

Section: Fine-tuningmentioning

confidence: 99%

“…Research that focuses on low-resource TTS tries to mitigate the effects of limited data via multi-speaker modelling and transfer learning [7][8][9][10][11][12][13]. By transferring knowledge gained from high-resource speakers, the quality of lowresource systems improves.…”

Section: Introductionmentioning

confidence: 99%

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Huybrechts

Merritt

Comini

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Its operations should be inspired by the features to model (1D convolution or RNN cells for long The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach term context, attention mechanism for recursive relationships). It should have a way to control expressiveness either with a categorical representation [23] or a continuous representation [24]. But it is important to take into account that annotations should not be acquired from humans by asking them to give absolute values on subjective concepts, but rather by asking them to compare examples.…”

Section: Summary and Applicationmentioning

confidence: 99%

The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

Tits¹,

Haddad²,

Dutoit³

2021

Human 4.0 - From Biology to Cybernetic

Self Cite

View full text Add to dashboard Cite

As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology.In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods.We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism.The last part of the Chapter intends to assemble the different aspects of the theory and summarize the concepts.

show abstract

“…In this study [32], we aim to find out whether it is possible to obtain an emotional TTS system by fine-tuning a neutral TTS system with a small emotional speech dataset. We study the impact of this fine-tuning on the intelligibility of generated speech and the subjective perception of the generated speech.…”

Section: A Synthesis With Emotion Adaptationmentioning

confidence: 99%

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech - a Deep Learning approach

Tits

2019

2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

View full text Add to dashboard Cite

In this project, we aim to build a Text-to-Speech system able to produce speech with a controllable emotional expressiveness. We propose a methodology for solving this problem in three main steps. The first is the collection of emotional speech data. We discuss the various formats of existing datasets and their usability in speech generation. The second step is the development of a system to automatically annotate data with emotion/expressiveness features. We compare several techniques using transfer learning to extract such a representation through other tasks and propose a method to visualize and interpret the correlation between vocal and emotional features. The third step is the development of a deep learning-based system taking text and emotion/expressiveness as input and producing speech as output. We study the impact of fine tuning from a neutral TTS towards an emotional TTS in terms of intelligibility and perception of the emotion.

show abstract

Exploring Transfer Learning for Low Resource Emotional TTS

Cited by 25 publications

References 17 publications

Low-Resource Expressive Text-To-Speech Using Data Augmentation

Low-Resource Expressive Text-To-Speech Using Data Augmentation

The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech - a Deep Learning approach

Contact Info

Product

Resources

About