2020
DOI: 10.48550/arxiv.2011.02314
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

Abstract: Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. In this paper, we study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion. We train a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 52 publications
0
1
0
Order By: Relevance
“…Previous research endeavors have explored various ambient sounds as cited in [17], yet remarkably, the emotional states of individuals within the room have never been integrated into the equation, amplifying the complexity of the challenge at hand, and making it more realistic. Furthermore, the usage of generative models have increased in the study of human emotions and context analysis due to the flexibility and versatility to represent and analize different types of data present all together in the same audio frame [18]- [24]. Generative models, particularly when integrated with classifiers such as variational autoencoders (VAEs), prove to be highly advantageous for classification tasks in the domain of speech and also with ambient sound.…”
Section: Introductionmentioning
confidence: 99%
“…Previous research endeavors have explored various ambient sounds as cited in [17], yet remarkably, the emotional states of individuals within the room have never been integrated into the equation, amplifying the complexity of the challenge at hand, and making it more realistic. Furthermore, the usage of generative models have increased in the study of human emotions and context analysis due to the flexibility and versatility to represent and analize different types of data present all together in the same audio frame [18]- [24]. Generative models, particularly when integrated with classifiers such as variational autoencoders (VAEs), prove to be highly advantageous for classification tasks in the domain of speech and also with ambient sound.…”
Section: Introductionmentioning
confidence: 99%