Speech Emotion Recognition Using Convolutional- Recurrent Neural Networks with Attention Model

Mu, Yawei; Gómez, Luis A. Hernández; Montes, Agustín; Martínez, Carlos Alcaraz; Wang, Xuetian

doi:10.12783/dtcse/cii2017/17273

Cited by 11 publications

(7 citation statements)

References 14 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…WA is the overall accuracy, calculated as the ratio of the total number of test data and the number of samples accurately predicted by the actual label. UA is calculated as the average of the recall values of four classes and is an important performance indicator in the evaluation of the SER model based on imbalanced datasets [ 19 , 20 , 26 ].…”

Section: Discussionmentioning

confidence: 99%

“…Recent SER models based on deep-learning architectures [ 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 ] have demonstrated state-of-the-art performance with an attention mechanism [ 19 , 20 , 22 , 23 , 25 , 26 ]. The deep-learning architectures adopted in previous studies included recurrent neural networks (RNN) [ 19 ], convolutional neural networks (CNN) [ 24 ], and convolutional RNNs (CRNN) [ 20 , 26 ]. Liu et al [ 21 ] presented an SER model of a decision tree for an extreme learning machine having a single hidden-layer feed-forward neural network, using a mixture of deep learning and typical classification techniques.…”

Section: Related Workmentioning

confidence: 99%

“…Recent SER models based on deep-learning architectures [19][20][21][22][23][24][25][26][27][28][29][30] have demonstrated state-of-the-art performance with an attention mechanism [19,20,22,23,25,26]. The deep-learning architectures adopted in previous studies included recurrent neural networks (RNN) [19], convolutional neural networks (CNN) [24], and convolutional RNNs (CRNN) [20,26].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Noh

Jeong

Lim

et al. 2021

Sensors

View full text Add to dashboard Cite

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Noh

Jeong

Lim

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…The authors YAWEI MU et.al [10] proposed a new method using distributed Convolution Neural Networks (CNN) [11] to automatically learn affect-salient features from raw spectral information, and then applying Bidirectional Recurrent Neural Network (BRNN) to obtain the temporal information from the output of CNN, but even with this method accuracy achieved is 64.08% .…”

Section: Cnn Classifier [5]mentioning

confidence: 99%

Voice Emotion Recognition using CNN and Decision Tree

Damodar*¹,

VANI²,

Anusuya³

2019

IJITEE

View full text Add to dashboard Cite

This paper presents the use of decision tree and CNN as classifier to classify the emotions from the English and Kannada audio data. The performance of CNN and DT are potential for various emotions. Comparative study of the classifiers using various parameters is presented. The performance of CNN has been identified as the best classifier for emotion recognition. Emotions are recognized with 72% and 63% accuracy using CNN and Decision Tree algorithms respectively. MFCC features are extracted from the audio signals and Model is trained, tested and evaluated accordingly by changing the parameters. Speech Emotion Recognition system is useful in psychiatric diagnosis, lie detection, call centre conversations, customer voice review, voice messages.

show abstract

“…The experimental results showed the high performance of the proposed method in IEMOCAP (Busso et al, 2008 ) and CHEAVD (Li et al, 2017 ) dataset. Mu et al ( 2017 ) used distributed convolutional neural network (CNN) to automatically learn the emotion features from the raw speech spectrum, and they used bidirectional BRNN to obtain the time information from the CNN output. Finally, the output sequence of BRNN was weighted by attention mechanism algorithm to focus on the useful part of emotion.…”

Section: Introductionmentioning

confidence: 99%

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

et al. 2021

View full text Add to dashboard Cite

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

show abstract

Speech Emotion Recognition Using Convolutional- Recurrent Neural Networks with Attention Model

Cited by 11 publications

References 14 publications

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Voice Emotion Recognition using CNN and Decision Tree

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Contact Info

Product

Resources

About