Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

Schnell, Bastian; Garner, Philip N.

doi:10.21437/ssw.2021-11

Cited by 8 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Emovox w/ Attention Weights (proposed): where the attention weight vector obtained from a pre-trained SER is used to represent the intensity [18];…”

Section: Reference Methods and Setupsmentioning

confidence: 99%

“…For example, in [19], an inter-to-intra distance ratio algorithm is applied to the learnt style tokens for emotional speech synthesis, where an interpolation technique is used to control emotion intensity. In [18], the authors show that a speech emotion recognizer is capable of generating a meaningful intensity representation via attention or saliency. In [77], [78], a relative attribute scheme is introduced to learn the emotion intensity for emotional speech synthesis.…”

Section: Expressive Speech Synthesis With Prosody Style Controlmentioning

confidence: 99%

“…There are generally two types of methods in the literature for emotion intensity control. One uses auxiliary features such as a state of voiced, unvoiced, and silence (VUS) [17], attention weights or a saliency map [18]. Another manipulates the internal emotion representations through interpolation [19] or scaling [20].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

“…• Emovox w/ Attention Weights (proposed): where the attention weight vector obtained from a pre-trained SER is used to represent the intensity [18];…”

Section: Reference Methods and Setupsmentioning

confidence: 99%

Section: Expressive Speech Synthesis With Prosody Style Controlmentioning

confidence: 99%

See 1 more Smart Citation

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…more subjective and challenging to model. Some studies use auxiliary features such as a state of voiced, unvoiced and silence (VUS) [86], attention weights or a saliency map [87] to control the emotion intensity. Other studies manipulate the internal emotion representations through interpolation [88], scaling [76] or distance-based quantization [89].…”

Section: Controllable Emotional Speech Synthesismentioning

confidence: 99%

Speech Synthesis With Mixed Emotions

Zhou

Şişman

Rana

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing and evaluating mixed emotions in speech.

show abstract

“…To the best of our knowledge, there is no existing deep learning-based SVS model that expresses emotions of varying intensities [9]. In the TTS field, many studies have been conducted to express types of emotions [10,11,12,13,14,15,16], but there are few studies to express the intensity of emotions [17,16].…”

Section: Introductionmentioning

confidence: 99%

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Kim¹,

Na²,

Choonghyeon³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose U-Singer, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity. During synthesizing singing voices according to the lyrics, pitch, and duration of the music score, U-Singer reflects singer characteristics and emotional intensity by adding variances in pitch, energy, and phoneme duration according to singer ID and emotional intensity. Representing all attributes by conditional residual embeddings in a single unified embedding space, U-Singer controls mutually correlated style attributes, minimizing interference. Additionally, we apply emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the model to express emotional intensity levels not included in the training data. In experiments, U-Singer synthesized high-fidelity singing voices reflecting the singer ID and emotional intensity. The visualization of the unified embedding space exhibits that U-singer estimates the correct variations in pitch and energy highly correlated with the singer ID and emotional intensity level. The audio samples are presented at https://u-singer.github.io. * arXiv preprint.

show abstract

Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

Cited by 8 publications

References 12 publications

Emotion Intensity and its Control for Emotional Voice Conversion

Emotion Intensity and its Control for Emotional Voice Conversion

Speech Synthesis With Mixed Emotions

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Contact Info

Product

Resources

About