Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech

Neumann, Michael; Vu, Ngoc Thang

doi:10.21437/interspeech.2017-917

Cited by 193 publications

(176 citation statements)

References 34 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…Figure 2 shows the confusion matrices of the proposed systems. In general, most of the emotion labels are frequently misclassified as neutral class, supporting the claims of [12,27]. The model confused between the excite and happy class since there exists a report of overlap in distinguishing these two classes even human evaluations [13].…”

Section: Performance Evaluationsupporting

confidence: 58%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.

show abstract

Section: Performance Evaluationsupporting

confidence: 58%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…To show the effectiveness of the proposed method, we compare our method with currently advanced approaches through the five-folder cross validation. Compared with our proposed method, these approaches [31,32] also utilized mel-scale spectrograms as inputs, and showed promising results for speech emotion recognition. Neumann et al [31] proposed an attentive CNN with multi-view learning objective function for speech emotion recognition.…”

Section: Comparison To Other Advanced Approachesmentioning

confidence: 99%

Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition

Lian¹,

Tao²,

Liu³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Prior works on speech emotion recognition utilize various unsupervised learning approaches to deal with low-resource samples. However, these methods pay less attention to modeling the long-term dynamic dependency, which is important for speech emotion recognition. To deal with this problem, this paper combines the unsupervised representation learning strategy -Future Observation Prediction (FOP), with transfer learning approaches (such as Fine-tuning and Hypercolumns). To verify the effectiveness of the proposed method, we conduct experiments on the IEMOCAP database. Experimental results demonstrate that our method is superior to currently advanced unsupervised learning strategies.

show abstract

“…Motivated by the success of deep learning techniques in various application domains, such as large scale image and speech recognition [4,5], several Deep Neural Network (DNN) or Convolutional Neural Network (CNN) based SER methods have recently been proposed [6,7,8,9,10,11,12]. In [6,7], a multistage procedure was applied, in which the DNN and CNN network were trained for frontend feature extraction, followed by a backend emotion recognizer such as SVM and Extreme Learning Machine (ELM).…”

Section: Introductionmentioning

confidence: 99%

“…Neumann el. al [12] further introduced an attention mechanism after the max-pooling operation. while Mirsamadi et.…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, simply average-pooling or max-pooling may be insufficient to derive effective representations for complex emotional expressions that require analysis of higher order statistics. Some recent works show the benefit of introducing an attention mechanism for representation learning [12,13,10]. However, they generally derive salient regions from the features in a bottomup manner.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

et al. 2018

View full text Add to dashboard Cite

This paper proposes an attention pooling based representation learning method for speech emotion recognition (SER). The emotional representation is learned in an end-to-end fashion by applying a deep convolutional neural network (CNN) directly to spectrograms extracted from speech utterances. Motivated by the success of GoogleNet, two groups of filters with different shapes are designed to capture both temporal and frequency domain context information from the input spectrogram. The learned features are concatenated and fed into the subsequent convolutional layers. To learn the final emotional representation, a novel attention pooling method is further proposed. Compared with the existing pooling methods, such as max-pooling and average-pooling, the proposed attention pooling can effectively incorporate class-agnostic bottom-up, and class-specific top-down, attention maps. We conduct extensive evaluations on benchmark IEMOCAP data to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 71.8% weighted accuracy (WA) and 68% unweighted accuracy (UA) over four emotions, which outperforms the state-of-the-art method by about 3% absolute for WA and 4% for UA.

show abstract

Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech

Cited by 193 publications

References 34 publications

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

Contact Info

Product

Resources

About