Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2822
|View full text |Cite
|
Sign up to set email alerts
|

Self-Attention for Speech Emotion Recognition

Abstract: Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
59
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 97 publications
(62 citation statements)
references
References 18 publications
0
59
0
Order By: Relevance
“…Neumann and Vu [33] proposed an attentive convolutional neural network (ACNN) to test the emotional discrimination of different feature set. In addition, self-attention based deep model [34], [35] demonstrated the effectiveness to improve the performances for SER. Unlike these studies, we apply a temporal attention model to the sliding window sequence instead of applying one based on LLDs.…”
Section: ) Temporal Attention Modelmentioning
confidence: 99%
“…Neumann and Vu [33] proposed an attentive convolutional neural network (ACNN) to test the emotional discrimination of different feature set. In addition, self-attention based deep model [34], [35] demonstrated the effectiveness to improve the performances for SER. Unlike these studies, we apply a temporal attention model to the sliding window sequence instead of applying one based on LLDs.…”
Section: ) Temporal Attention Modelmentioning
confidence: 99%
“…Tarantino et al [ 31 ] used the global windowing method on top of the already extracted frames to express relationships between datapoints, and applied self-attention to extract 384 low-level features to weight each frame based on correlations with the other frames. Then, they classified emotions using a CNN model and achieved a weighted accuracy of 64.33% for IEMOCAP.…”
Section: Related Workmentioning
confidence: 99%
“…This method has a drawback in that the classifying emotions can be time consuming because the audio file must be analyzed and converted to audio without noise or silence for preprocessing. In the aforementioned studies [ 29 , 30 , 31 , 32 , 33 ], local correlations between spectral features could be ignored by using normalized spectral features from pre-processing.…”
Section: Related Workmentioning
confidence: 99%
“…Variants of attention-based mechanisms have been proposed which performed significantly better than the previous models [18,19,16]. One of the possible reasons why attention models outperform others is that the models learn the biases for a specific task, or group of tasks, leading to improved generalisation.…”
Section: Related Workmentioning
confidence: 99%