Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Shilandari, Arash; Marvi, Hossein; Khosravi, Hossein; Wang, Wenwu

doi:10.1007/s11760-022-02156-9

Cited by 19 publications

(12 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They found that data augmentation is very helpful in building speech recognition systems. Other articles have also been published on improving emotion recognition rate using gender classification [15], extracting resistant speech features [16], and data augmentation using Cycle-Gans [17].…”

Section: The Related Studiesmentioning

confidence: 99%

Effective Feature Selection in Speech Emotion RecognitionSystems using Generative Adversarial Networks

Shilandari

Marvi

Hadjiabdolhamid

2023

Preprint

View full text Add to dashboard Cite

Thus far, it has been unknown whether feature selection methods succeed in increasing the efficiency of speech-emotion recognition systems. This article discusses and evaluates feature selection for data augmentation purposes in a speech emotion recognition system. This study performed the experiments using Python and on four common databases: EMODB,eNTERFACE05, SAVEE, and IEMOCAP. Data analysis was conducted on all four databases for five emotions: sadness, fear, anger, happiness, andneutral. A support vector machine was used to classify emotions. We also used a generative adversarial network to augment data and two feature selection networks, Fisher and Linear Discriminant Analysisalgorithms. In two steps and with the feedback from the classification network, we could bring the speech emotion recognition to an optimal point in sample number and feature vector dimensions. The results showed that using Linear Discriminant Analysis and the Fisher method simultaneously in the generative adversarial networks can remove redundant and irrelevant features while preserving features with important emotional information for classification. The results obtained from the proposed method were compared with that of recent studies. The proposed method was able to achieve 86.32% accuracy in the Berlin Database of Emotional Speech.

show abstract

Section: The Related Studiesmentioning

confidence: 99%

Effective Feature Selection in Speech Emotion RecognitionSystems using Generative Adversarial Networks

Shilandari

Marvi

Hadjiabdolhamid

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…However, non-basic emotions account for the majority of emotion manifestations in human-to-human communication. Furthermore, the majority of existing emotion recognition systems are unimodal: the system only processes speech data or face images [31]. In recent years, multimodal affect analysis has received a lot of attention, however, a very limited research has been done to exploit the audio-visual cues for emotion recognition tasks.…”

Section: Related Work a Unimodal Emotion Recognitionmentioning

confidence: 99%

Deep Learning for Audio Visual Emotion Recognition

Hussain

Wang

Bouaynaya

et al. 2022

2022 25th International Conference on Information Fusion (FUSION)

View full text Add to dashboard Cite

Human emotions can be presented in data with multiple modalities, e.g. video, audio and text. An automated system for emotion recognition needs to consider a number of challenging issues, including feature extraction, and dealing with variations and noise in data. Deep learning have been extensively used recently, offering excellent performance in emotion recognition. This work presents a new method based on audio and visual modalities, where visual cues facilitate the detection of the speech or non-speech frames and the emotional state of the speaker. Different from previous works, we propose the use of novel speech features, e.g. the Wavegram, which is extracted with a one-dimensional Convolutional Neural Network (CNN) learned directly from time-domain waveforms, and Wavegram-Logmel features which combines the Wavegram with the log mel spectrogram. The system is then trained in an end-to-end fashion on the SAVEE database by also taking advantage of the correlations among each of the streams. It is shown that the proposed approach outperforms the traditional and state-of-the art deep learning based approaches, built separately on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions.

show abstract

“…Also, various data augmentation strategies have been successfully adopted for the same purpose, e.g. [36], [37]. On the other hand, the application of dimensionality reduction transformations to the model's input data is an established strategy for reducing resource demands while limiting the loss of useful information carried by the input data.…”

Section: Introductionmentioning

confidence: 99%

Learning Speech Emotion Representations in the Quaternion Domain

Guizzo

Weyde

Scardapane

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semisupervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotionrelated characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.

show abstract

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Cited by 19 publications

References 31 publications

Effective Feature Selection in Speech Emotion RecognitionSystems using Generative Adversarial Networks

Effective Feature Selection in Speech Emotion RecognitionSystems using Generative Adversarial Networks

Deep Learning for Audio Visual Emotion Recognition

Learning Speech Emotion Representations in the Quaternion Domain

Contact Info

Product

Resources

About