Self-Attention for Speech Emotion Recognition

Tarantino, Lorenzo; Garner, Philip N.; Lazaridis, Alexandros

doi:10.21437/interspeech.2019-2822

Cited by 97 publications

(62 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neumann and Vu [33] proposed an attentive convolutional neural network (ACNN) to test the emotional discrimination of different feature set. In addition, self-attention based deep model [34], [35] demonstrated the effectiveness to improve the performances for SER. Unlike these studies, we apply a temporal attention model to the sliding window sequence instead of applying one based on LLDs.…”

Section: ) Temporal Attention Modelmentioning

confidence: 99%

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Peng

Zhu

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Emotion information from speech can effectively help robots understand speaker's intentions in natural human-robot interaction. The human auditory system can easily track temporal dynamics of emotion by perceiving the intensity and fundamental frequency of speech, and focus on the salient emotion regions. Therefore, speech emotion recognition combined with the auditory mechanism and attention mechanism may be an effective way. Some previous studies used auditory-based static features to identify emotion while ignoring the emotion dynamics. Some other studies used attention models to capture the salient regions of emotion while ignoring cognitive continuity. To fully utilize the auditory and attention mechanism, we first investigate temporal modulation cues from auditory front-ends and then propose a joint deep learning model that combines 3D convolutions and attention-based sliding recurrent neural networks (ASRNNs) for emotion recognition. Our experiments on the IEMOCAP and MSP-IMPROV datasets indicate that the proposed method can be effectively used to recognize the emotions of speech from temporal modulation cues. The subjective evaluation shows that the attention patterns of the attention model are basically consistent with human behaviors in recognizing the emotions.

show abstract

Section: ) Temporal Attention Modelmentioning

confidence: 99%

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Peng

Zhu

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Tarantino et al [ 31 ] used the global windowing method on top of the already extracted frames to express relationships between datapoints, and applied self-attention to extract 384 low-level features to weight each frame based on correlations with the other frames. Then, they classified emotions using a CNN model and achieved a weighted accuracy of 64.33% for IEMOCAP.…”

Section: Related Workmentioning

confidence: 99%

“…This method has a drawback in that the classifying emotions can be time consuming because the audio file must be analyzed and converted to audio without noise or silence for preprocessing. In the aforementioned studies [ 29 , 30 , 31 , 32 , 33 ], local correlations between spectral features could be ignored by using normalized spectral features from pre-processing.…”

Section: Related Workmentioning

confidence: 99%

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Seo

2020

Sensors

View full text Add to dashboard Cite

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

show abstract

“…Variants of attention-based mechanisms have been proposed which performed significantly better than the previous models [18,19,16]. One of the possible reasons why attention models outperform others is that the models learn the biases for a specific task, or group of tasks, leading to improved generalisation.…”

Section: Related Workmentioning

confidence: 99%

Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition

et al. 2020

View full text Add to dashboard Cite

Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. The fundamental challenges of speech emotion recognition from a machine learning standpoint is to extract patterns which carry maximum correlation with the emotion information encoded in this signal, and to be as insensitive as possible to other types of information carried by speech. In this paper, a novel recurrent residual temporal context modelling framework is proposed. The framework includes mixture of multi-view attention smoothing and high dimensional feature projection for context expansion and learning feature representations. The framework is designed to be robust to changes in speaker and other distortions, and it provides state-of-the-art results for speech emotion recognition. Performance of the proposed approach is compared with a wide range of current architectures in a standard 4-class classification task on the widely used IEMOCAP corpus. A significant improvement of 4% unweighted accuracy over state-of-the-art systems is observed. Additionally, the attention vectors have been aligned with the input segments and plotted at two different attention levels to demonstrate the effectiveness.

show abstract

Self-Attention for Speech Emotion Recognition

Cited by 97 publications

References 18 publications

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition

Contact Info

Product

Resources

About