When Old Meets New: Emotion Recognition from Speech Signals

Araño, Keith April; Gloor, Peter A.; Orsenigo, Carlotta; Vercellis, Carlo

doi:10.1007/s12559-021-09865-2

Cited by 22 publications

(3 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As illustrated, IMEMD-CRNN consists of three modules: IMEMD-based emotional speech signal decomposition, extraction of time-frequency features from IMFs, and speech emotion recognition based on CRNN. Arano et al (2021) show that effective hand-crafted features, compared to sophisticated deep-learning feature sets, can still have better performance. Therefore, we combine IMEMD-based features with CRNN network in order to improve the robustness and accuracy of the speech emotion recognition system.…”

Section: Methodsmentioning

confidence: 99%

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

Sun

2023

Front. Psychol.

View full text Add to dashboard Cite

Speech emotion recognition (SER) is the key to human-computer emotion interaction. However, the nonlinear characteristics of speech emotion are variable, complex, and subtly changing. Therefore, accurate recognition of emotions from speech remains a challenge. Empirical mode decomposition (EMD), as an effective decomposition method for nonlinear non-stationary signals, has been successfully used to analyze emotional speech signals. However, the mode mixing problem of EMD affects the performance of EMD-based methods for SER. Various improved methods for EMD have been proposed to alleviate the mode mixing problem. These improved methods still suffer from the problems of mode mixing, residual noise, and long computation time, and their main parameters cannot be set adaptively. To overcome these problems, we propose a novel SER framework, named IMEMD-CRNN, based on the combination of an improved version of the masking signal-based EMD (IMEMD) and convolutional recurrent neural network (CRNN). First, IMEMD is proposed to decompose speech. IMEMD is a novel disturbance-assisted EMD method and can determine the parameters of masking signals to the nature of signals. Second, we extract the 43-dimensional time-frequency features that can characterize the emotion from the intrinsic mode functions (IMFs) obtained by IMEMD. Finally, we input these features into a CRNN network to recognize emotions. In the CRNN, 2D convolutional neural networks (CNN) layers are used to capture nonlinear local temporal and frequency information of the emotional speech. Bidirectional gated recurrent units (BiGRU) layers are used to learn the temporal context information further. Experiments on the publicly available TESS dataset and Emo-DB dataset demonstrate the effectiveness of our proposed IMEMD-CRNN framework. The TESS dataset consists of 2,800 utterances containing seven emotions recorded by two native English speakers. The Emo-DB dataset consists of 535 utterances containing seven emotions recorded by ten native German speakers. The proposed IMEMD-CRNN framework achieves a state-of-the-art overall accuracy of 100% for the TESS dataset over seven emotions and 93.54% for the Emo-DB dataset over seven emotions. The IMEMD alleviates the mode mixing and obtains IMFs with less noise and more physical meaning with significantly improved efficiency. Our IMEMD-CRNN framework significantly improves the performance of emotion recognition.

show abstract

Section: Methodsmentioning

confidence: 99%

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

Sun

2023

Front. Psychol.

View full text Add to dashboard Cite

show abstract

“…The expression by different people is not exactly the same, so it is difficult to obtain a unified and recognized emotion description in the field of speech emotion recognition [11] . Most of the current studies used for speech emotion recognition are based on the discrete emotions.…”

Section: Discrete Model For Speech Emotionmentioning

confidence: 99%

Speech emotion recognition based on ResNet-BiGRU network

Fu,

Xu,

Yuan

2023

Fourth International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2023)

View full text Add to dashboard Cite

In order to improve the anthropomorphic nature of intelligent speech products, the academic research on speech emotion recognition is getting hotter and hotter. Currently, the speech emotion recognition system mainly consists of two steps: speech feature extraction and speech feature classification. In order to improve the accuracy of speech emotion recognition, the Mel Frequency Cepstrum Coefficient (MFCC) of speech signal, which has a good effect on the feature capability in the field of speech at this stage, is chosen as the input of the deep learning network, and the ResNet-BiGRU network based on the attention mechanism is used to extract the MFCC information is extracted using ResNet-BiGRU network based on the attention mechanism. The experimental results show that the introduction of attention mechanism in the model can effectively focus on useful information and reduce the interference of redundant information. The accuracy rate on the Chinese sentiment corpus CASIA reached 84.83%.

show abstract

“…The classification task in machine learning is normally performed using a single classifier, hierarchical classifier, or classifier ensemble approach. Araño et al [6] utilized a hybrid set of features for classifying emotions from speech consisting of MFCCs and image features extracted from spectrograms. The 1 MFCCs features along with the long short-term memory (LSTM) network performed better as compared to the SVM classifier.…”

Section: Introductionmentioning

confidence: 99%

Audio Based Emotion Classification Using Classifier Ensemble

Mudassar,

Ul Haq,

Majid

et al. 2023

PJETS

View full text Add to dashboard Cite

This paper presents a novel approach of combining classifiers outputs for audio emotion recognition. The proposed classifiers ensemble technique combines the confusion matrices of base classifiers. It is because some classifiers with overall lower performance have better accuracy for a specific class as compared to others with overall higher accuracy. In this approach, the best results obtained for different emotion classes from various classifiers are combined to create a combined confusion matrix. The performance of this approach was analyzed using three emotional speech databases in different languages, i.e., Berlin emotional speech database (EMO-DB), Italian emotional speech database (EMOVO-DB), and Surrey audio-visual expressed emotion database (SAVEE-DB). The openSMILE toolkit was used to extract a total of 8543 audio features. These features include pitch, energy, intensity, jitter, shimmer, formants, MFCC, MFB, LSP and spectral features. These features were normalized using min-max normalization technique, while correlation-based feature selection (CFS) with best-first search approach was used for feature reduction. The classification was performed using five different base classifiers, i.e., SVM, MLP, IBK, AdaBoost, and Random Forest. The experimental results showed better performance for the proposed technique as compared to other state-of-the-art methods. The classification accuracies obtained for seven emotion classes were 91.8%, 83.7%, and 80.5% for the EMO-DB, EMOVO-DB, and SAVEE-DB, respectively.

show abstract

When Old Meets New: Emotion Recognition from Speech Signals

Cited by 22 publications

References 60 publications

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

Speech emotion recognition based on ResNet-BiGRU network

Audio Based Emotion Classification Using Classifier Ensemble

Contact Info

Product

Resources

About