Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions

Tiwari, Upasana; Soni, Meet H.; Chakraborty, Rupayan; Panda, Amlana; Kopparapu, Sunil Kumar

doi:10.1109/icassp40776.2020.9053581

“…The linear projection layer predicts the emotion class possibility from the utterance-level emotional features. We perform data augmentation by adding white Gaussian noise to improve the robustness of SER ( [122], [123], [124], [125]).…”

Section: Speech Emotion Recognizermentioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou

¹

,

Şişman

²

,

Rana

³

et al. 2023

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

“…A black dot (•) in a cell means the corresponding database was used in the research mentioned at the bottom of the column. Year 2005 2010 2011 2013 2014 2016 2017 2018 2019 2020 Research HMM, SVM [6] SVM [17] GerDA, RBM [22] LSTM, BLSTM [28] CRF, CRBM [24] SVM, PCA, LPP, TSL [90] DNN, ANN, ELM [23] DCNN, LSTM [29] CNN [21] DCNN [26] LSTM, MTL [33] ANN, PSOF [19] DCNN, DTPM, TSL [25] LSTM, VAE [31] GAN [86] GAN, SVM [88] LSTM, ATTN [94] DCNN, LSTM [30] CNN, VAE, DAE, AAE, AVB [32] DCNN, GAN [89] LDA, TSL, TLSL [91] CNN, BLSTM, ATTN, MTL [95] LSTM, ATTN [83] DNN, Generative [76] DCNN [79] Additionally, Figure 2a shows a comparison between accuracies reported in deep learning methods based on EMO-DB versus IEMOCAP, which we can see there is a clear separation between the accuracies published. Again, one reason could be the fact that EMO-DB has one degree of magnitude fewer number of samples than IEMOCAP, and using it with deep learning methods makes it more prone to overfitting.…”

Section: Discussionmentioning

confidence: 99%

“…Lately, Tiwari et al [ 76 ] address the noise robustness of SER in the presence of additive noise by employing an utterance level parametric generative noise model. Their deep neural network framework is useful for defeating unseen noise since the generated noise can cover the entire noise space in the Mel filter bank energy domain.…”

Section: Emotion Recognition Methodsmentioning

confidence: 99%

Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models

Abbaschian

¹

,

Sierra-Sosa

²

,

Elmaghraby

³

2021

Sensors

View full text Add to dashboard Cite

The advancements in neural networks and the on-demand need for accurate and near real-time Speech Emotion Recognition (SER) in human–computer interactions make it mandatory to compare available methods and databases in SER to achieve feasible solutions and a firmer understanding of this open-ended problem. The current study reviews deep learning approaches for SER with available datasets, followed by conventional machine learning techniques for speech emotion recognition. Ultimately, we present a multi-aspect comparison between practical neural network approaches in speech emotion recognition. The goal of this study is to provide a survey of the field of discrete speech emotion recognition.

show abstract

“…The linear projection layer predicts the emotion class possibility from the utterance-level emotional features. We perform data augmentation by adding white Gaussian noise to improve the robustness of SER ( [120], [121], [122], [123]).…”

Section: Speech Emotion Recognizermentioning

confidence: 99%

Emotion Intensity and its Control for Emotional Voice Conversion

Zhou,

Sisman,

Rana

et al. 2022

Preprint

View full text Add to dashboard Cite

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.

show abstract

Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions

Cited by 28 publications

References 14 publications

Emotion Intensity and its Control for Emotional Voice Conversion

Emotion Intensity and its Control for Emotional Voice Conversion

Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models

Emotion Intensity and its Control for Emotional Voice Conversion

Contact Info

Product

Resources

About