2023
DOI: 10.1109/taffc.2022.3233324
|View full text |Cite
|
Sign up to set email alerts
|

Speech Synthesis With Mixed Emotions

Abstract: Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the trai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
12
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(13 citation statements)
references
References 120 publications
0
12
0
1
Order By: Relevance
“…This would allow a wider representation of emotions, yet still underexplored in the emotional TTS design. To the best of our knowledge, it has only been adapted in a few neural emotional TTS, most notably (Zhou et al, 2022b;Tang et al, 2023), where they managed to simulate intensity-level and mixture of emotions through rank-based method and emotion embedding conditioning, respectively.…”
Section: Introductionmentioning
confidence: 99%
“…This would allow a wider representation of emotions, yet still underexplored in the emotional TTS design. To the best of our knowledge, it has only been adapted in a few neural emotional TTS, most notably (Zhou et al, 2022b;Tang et al, 2023), where they managed to simulate intensity-level and mixture of emotions through rank-based method and emotion embedding conditioning, respectively.…”
Section: Introductionmentioning
confidence: 99%
“…One way in which emotions are expressed by individuals in social interactions is via speech signals [1]. In the context of human-machine interaction systems, the generation of spoken dialogue is a fundamental facet of natural interaction between humans and machines [2,3]. More importantly, to improve the naturalness of machine communication, the generation of emotionally expressive speech is required.…”
Section: Introductionmentioning
confidence: 99%
“…To overcome this, the circumplex model [11] captures emotional expressions using two continuous and independent dimensions, i.e., arousal (relaxed or passive vs. aroused or activated) and valence (positive vs. negative) [12,13]. In SEC research, efforts have also been made to control the intensity of categorical emotion representations, e.g., using mixed emotion representations [3] or modeling emotion intensity as an auxiliary task [14]. Note that SEC using the dimensional representations directly archives intensity control, as opposed to an additional effort in the categorical representation case.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the existing methods in adaptive TTS face significant limitations when applied to emotional speech generation, whose objective is to synthesize the speech with the desired emotion, in zero-shot scenarios. Specifically, existing approaches for zero-shot adaptive TTS do not consider the emotion-controllable generation, while the most of existing approaches for emotional TTS [9,10,11,12,13,14] do not take zero-shot scenarios into account. These limitations necessitate the development of new methods that can effectively address both requirements in TTS systems, as depicted in Figure 1.…”
Section: Introductionmentioning
confidence: 99%