Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2512
|View full text |Cite
|
Sign up to set email alerts
|

A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective

Abstract: We introduce a new model for emotion conversion in speech based on highway neural networks. Our model uses the contextual pitch, energy and spectral information of a source emotional utterance to predict the framewise fundamental frequency and signal intensity under a target emotion. We also incorporate a latent gender representation to promote cross-speaker generalizability. Our neural network is trained to maximize the error log-likelihood under an assumed Laplacian distribution. We validate our model on the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 17 publications
(22 reference statements)
0
9
0
Order By: Relevance
“…Studies have also revealed that the emotions can be expressed through universal principles that are shared across different individuals and cultures (Ekman, 1992;Manokara et al, 2021). This motivates the study of multispeaker (Shankar et al, 2019b(Shankar et al, , 2020, and speaker-independent emotional voice conversion (Zhou et al, 2020b;Choi and Hahn, 2021).…”
Section: Related Work Speech Emotion Conversionmentioning
confidence: 98%
“…Studies have also revealed that the emotions can be expressed through universal principles that are shared across different individuals and cultures (Ekman, 1992;Manokara et al, 2021). This motivates the study of multispeaker (Shankar et al, 2019b(Shankar et al, , 2020, and speaker-independent emotional voice conversion (Zhou et al, 2020b;Choi and Hahn, 2021).…”
Section: Related Work Speech Emotion Conversionmentioning
confidence: 98%
“…Moreover, there was also a study to combine hidden Markov model (HMM), GMM, and F0 segment selection [36] for spectrum and prosody conversion. Recent deep learning methods, such as deep belief network (DBN) [37], deep bidirectional long-short-term memory (DBLSTM) [38], highway neural network [10,39], sequence-to-sequence [40] and rule-based model [41] have achieved remarkable performance on emotion conversion. We note that the prior studies of emotion conversion do not provide an in-depth investigation of the disentanglement of emotional elements in speech, which will be the focus of this paper.…”
Section: Speaker a (Happy)mentioning
confidence: 99%
“…In other words, they usually fail on multi-speaker datasets because features from different emotions tend to overlap considerably among various speakers. In an attempt to solve this, multi-speaker EVC systems [25], [26] that correspondingly modified the pitch and the energy, using a highway network [27] and a convolutional GAN network [28], were introduced. One of the recent studies [29] demonstrated variational autoencoding Wasserstein generative adversarial network (VAWGAN) based EVC utilizing the continuous wavelet transform (CWT) decomposition of F0 which allows the network to learn the speaker-independent emotion pattern across different speakers.…”
Section: Introductionmentioning
confidence: 99%