Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1325
|View full text |Cite
|
Sign up to set email alerts
|

Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator

Abstract: We introduce a novel method for emotion conversion in speech that does not require parallel training data. Our approach loosely relies on a cycle-GAN schema to minimize the reconstruction error from converting back and forth between emotion pairs. However, unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion (e.g., A → B) or to its inverse (B → A). We will show that this setup, which we refer to as a var… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…Thus, in this work we modify both the F 0 contour, associated with the excitation source characteristics, along with the spectral envelope, associated with the vocal tract system characteristics. Recently [30,31], convolutional network based cycleGANs were considered for modeling the F 0 values. In [30], a convolutional network based cycleGAN was considered for modeling the F 0 values represented using a continuous wavelet transform (CWT) for emotion conversion.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Thus, in this work we modify both the F 0 contour, associated with the excitation source characteristics, along with the spectral envelope, associated with the vocal tract system characteristics. Recently [30,31], convolutional network based cycleGANs were considered for modeling the F 0 values. In [30], a convolutional network based cycleGAN was considered for modeling the F 0 values represented using a continuous wavelet transform (CWT) for emotion conversion.…”
Section: Related Workmentioning
confidence: 99%
“…In [30], a convolutional network based cycleGAN was considered for modeling the F 0 values represented using a continuous wavelet transform (CWT) for emotion conversion. In [31], the generator network consisted of a convolutional neural network followed by a deterministic block with static parameters for modeling the F0 contour. We introduce auto-regressive generator networks to better model the multi-resolution temporal coherence of the F 0 contour.…”
Section: Related Workmentioning
confidence: 99%
“…There have been studies on deep learning approaches for emotional voice conversion that do not require parallel training data, such as cycle-consistent adversarial network (CycleGAN)-based [17,18] and autoencoder-based frameworks [19,20]. However, they are typically designed for a fixed set of conversion pairs.…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by the success in speaker voice conversion, these methods are adopted to model both spectral and prosodic parameters for emotional voice conversion. Successful attempts include GMM [11], sparse representation [12], deep bi-directional long-short-term memory (BLSTM) network [13], GAN-based [14][15][16][17] and autoencoder-based [18][19][20][21] methods. These frameworks model the mapping on a frame-by-frame basis.…”
Section: Introductionmentioning
confidence: 99%