Speech Synthesis With Mixed Emotions

Zhou, Kun; Şişman, Berrak; Rana, Rajib; Schuller, Björn; Li, Haizhou

doi:10.1109/taffc.2022.3233324

Cited by 14 publications

(13 citation statements)

References 120 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This would allow a wider representation of emotions, yet still underexplored in the emotional TTS design. To the best of our knowledge, it has only been adapted in a few neural emotional TTS, most notably (Zhou et al, 2022b;Tang et al, 2023), where they managed to simulate intensity-level and mixture of emotions through rank-based method and emotion embedding conditioning, respectively.…”

Section: Introductionmentioning

confidence: 99%

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

Chevi¹,

Prasojo²,

Aji

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS † , incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

show abstract

Section: Introductionmentioning

confidence: 99%

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

Chevi¹,

Prasojo²,

Aji

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…One way in which emotions are expressed by individuals in social interactions is via speech signals [1]. In the context of human-machine interaction systems, the generation of spoken dialogue is a fundamental facet of natural interaction between humans and machines [2,3]. More importantly, to improve the naturalness of machine communication, the generation of emotionally expressive speech is required.…”

Section: Introductionmentioning

confidence: 99%

“…To overcome this, the circumplex model [11] captures emotional expressions using two continuous and independent dimensions, i.e., arousal (relaxed or passive vs. aroused or activated) and valence (positive vs. negative) [12,13]. In SEC research, efforts have also been made to control the intensity of categorical emotion representations, e.g., using mixed emotion representations [3] or modeling emotion intensity as an auxiliary task [14]. Note that SEC using the dimensional representations directly archives intensity control, as opposed to an additional effort in the categorical representation case.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

Prabhu¹,

Carbajal²,

Lehmann‐Willenbrock³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance, and subsequently uses a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. For better representation and to achieve emotion intensity control, we specifically focus on the arousal dimension of continuous representations, as opposed to performing emotion conversion on categorical representations. We test our methodology on the large in-the-wild MSP-Podcast dataset. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion. Results further reveal that the methodology better synthesises speech for mid-scale arousal (2 to 6) than for extreme arousal (1 and 7).

show abstract

“…However, the existing methods in adaptive TTS face significant limitations when applied to emotional speech generation, whose objective is to synthesize the speech with the desired emotion, in zero-shot scenarios. Specifically, existing approaches for zero-shot adaptive TTS do not consider the emotion-controllable generation, while the most of existing approaches for emotional TTS [9,10,11,12,13,14] do not take zero-shot scenarios into account. These limitations necessitate the development of new methods that can effectively address both requirements in TTS systems, as depicted in Figure 1.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Tang¹,

Luo²,

Zhao³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET-Speech.github.io/ZET-Speech-Demo/.

show abstract

Speech Synthesis With Mixed Emotions

Cited by 14 publications

References 120 publications

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Contact Info

Product

Resources

About