2020
DOI: 10.1109/access.2020.3014733
|View full text |Cite
|
Sign up to set email alerts
|

Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling

Abstract: In the temporal process of expressing the emotions, some intervals embed more salient emotion information than others. In this paper, by introducing an attentive temporal pooling module into the deep neural network (DNN) architecture, we present a simple but effective speech emotion recognition (SER) framework, which is able to automatically highlight the emotionally salient segments while suppressing the influence of less relevant ones. For an input speech utterance, the extracted feature sequence of hand-cra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 44 publications
0
1
0
Order By: Relevance
“…Xia et al [36] suggested that DNN-based SER captures the temporal segment-level aspects of low-level features of voice signals. It used low-level elements of the emotion signal linked to energy, spectral, statistical, and voice.…”
Section: Related Workmentioning
confidence: 99%
“…Xia et al [36] suggested that DNN-based SER captures the temporal segment-level aspects of low-level features of voice signals. It used low-level elements of the emotion signal linked to energy, spectral, statistical, and voice.…”
Section: Related Workmentioning
confidence: 99%
“…The Gaussian Mixture Model (GMM) and an additional DNN are used to extract emotional saliency weights from condensed representations. Surprisingly, our methodology relies just on utterance-level labels to achieve state-of-the-art SER performance on many public emotion datasets, including as RML, EMO-DB, and IEMOCAP, without requiring supervisory information at the frame or segment level [6].…”
Section: Related Workmentioning
confidence: 99%
“…It handles domain mismatch and data disturbances. Smoothing the adversarial model needed a bigger dataset [57]. The phase and loudness of speech reduce the frame clipping impact in SER.…”
Section: Ser Using Machine Learning Based Techniquesmentioning
confidence: 99%