3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition

Chen, Mingyi; He, Xi; Yang, Jing; Zhang, Han

doi:10.1109/lsp.2018.2860246

Cited by 401 publications

(252 citation statements)

References 9 publications

Supporting

Mentioning

227

Contrasting

Unclassified

Order By: Relevance

“…Similarly, Zhao et al [22] implemented an attention layer right after the RNNs to extract the most interesting acoustic parts in the continuum. Apart from the RNNs and DNNs, the attention layer was also integrated with CNNs [23,24]. All these works, nevertheless, were conducted under the usage of traditional hand-crafted features, and have not explicitly investigated the differences of attention in an MTL framework.…”

Section: Related Workmentioning

confidence: 99%

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

Zhang

Schuller

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i. e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

Zhang

Schuller

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…To train the different models with different features to increase the accuracy up to 71%, but they used same architecture, which is used for computer vision-related tasks. Chen et al [42] developed a system for SER using 3D CNN architecture and trained model to increase the accuracy of SER, but he also used the pooling scheme to develop the network. Due to this limitation, we explored the plain CNN architecture to propose a new model for SER to give well and outperform results from state-of-the-art.…”

Section: Discussionmentioning

confidence: 99%

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Mustaqeem

Kwon

2019

Sensors

240

View full text Add to dashboard Cite

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

show abstract

“…As can be seen in Figure 3, attention layer has been added after the LSTM layer to score the importance of the sequence of high-level features to the final decision [36].…”

Section: Attention Layermentioning

confidence: 99%

Speech Emotion Recognition Using Scalogram Based Deep Structure

Aghajani

Afrakoti

2020

IJE

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concatenated Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The CNN can be used to learn local salient features from speech signals, images, and videos. Moreover, the RNNs have been used in many sequential data processing tasks in order to learn long-term dependencies between the local features. A combination of these two gives us the advantage of the strengths of both networks. In the proposed method, CNN has been applied directly to a scalogram of speech signals. Then, the attention-mechanism-based RNN model was used to learn long-term temporal relationships of the learned features. Experiments on various data such as RAVDESS, SAVEE, and Emo-DB demonstrate the effectiveness of the proposed SER method.

show abstract

3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition

Cited by 401 publications

References 9 publications

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Speech Emotion Recognition Using Scalogram Based Deep Structure

Contact Info

Product

Resources

About