Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Zhao, Ziping; Bao, Zhongtian; Cummins, Nicholas; Wang, Haishuai; Schuller, Björn

doi:10.21437/interspeech.2019-1649

Cited by 52 publications

(39 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is also worth noting that the data distribution of each emotion class is heavily imbalanced. Therefore, following the approach of [ 50 , 51 ], we merged the happiness and excitement utterances into the happiness class. We used four categories of emotions—namely neutral, happiness, sadness, and anger—for training and evaluation.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Lee

Han

2020

Sensors

View full text Add to dashboard Cite

Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to speaker emotion, it brings certain advantages over the handcrafted-feature-based methods. There are, however, some challenges in applying them to the emotion recognition problem, because data required for properly training deep networks are often lacking. Therefore, there is a need for a new deep-learning-based approach which can exploit available information from given speech signals to the maximum extent possible. Our proposed method, called “Fusion-ConvBERT”, is a parallel fusion model consisting of bidirectional encoder representations from transformers and convolutional neural networks. Extensive experiments were conducted on the proposed model using the EMO-DB and Interactive Emotional Dyadic Motion Capture Database emotion corpus, and it was shown that the proposed method outperformed state-of-the-art techniques in most of the test configurations.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Lee

Han

2020

Sensors

View full text Add to dashboard Cite

show abstract

“…Attention mechanisms have been adopted in several works such as [24], [27], [47] and [22]. Different from our work, [24] investigated learning salient frames through an attentive CNN with multi-view learning objective function.…”

Section: ) Comparison With the State Of The Artmentioning

confidence: 98%

“…parts for the whole utterance. In [47], two attention mechanisms were investigated to learn the emotionally relevant frames based on the BLSTM-CTC framework. One is the component attention, the other is quantum attention.…”

Section: ) Comparison With the State Of The Artmentioning

confidence: 99%

“…Reference Method WA (%) UA (%) RML [9] LLDs + HMM 64.00 N/A [18] spectrogram + AlexNet 66.17 N/A [44] hand [24] LLDs + attentive CNN 62.11 N/A [10] LLDs + RNN-ELM 62.85 63.89 [37] LLDs + conditional VAE + LSTM 64.93 62.81 [45] hand-crafted Fea. + GAN + SVM N/A 60.29 [46] spectrogram + CNN-LSTM 68.80 59.40 [47] spectrogram + attention-based BLSTM-CTC 69.0 67.0 [22] spectrogram will extend the proposed framework with attentive temporal pooling to audio visual multi-modal emotion recognition to further improve the emotion recognition performance.…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling

2020

View full text Add to dashboard Cite

In the temporal process of expressing the emotions, some intervals embed more salient emotion information than others. In this paper, by introducing an attentive temporal pooling module into the deep neural network (DNN) architecture, we present a simple but effective speech emotion recognition (SER) framework, which is able to automatically highlight the emotionally salient segments while suppressing the influence of less relevant ones. For an input speech utterance, the extracted feature sequence of hand-crafted low-level descriptors (LLDs) are evenly split into several overlapping temporal segments, and the segment-level features are computed by performing functionals on the LLDs of each segment. These segment-level features are then input into a DNN model outputting the emotion probabilities as well as the more condensed representation of each segment. An attentive temporal pooling module, consisting of an auxiliary DNN and a Gaussian Mixture Model (GMM), is proposed to learn the emotional saliency weights of different temporal segments from the condensed representations, which are then assigned to the segmentlevel emotion probabilities for the final utterance-level prediction. Notably, the attentive temporal pooling module and the DNN architecture for feature abstraction can be jointly trained using only the utterance-level labels, while without any frame-level or segment-level supervisory information. Experimental results on the three public released emotion datasets RML, EMO-DB, and IEMOCAP show that the proposed framework obtains state-of-the-art performance on SER. INDEX TERMS Attentive temporal pooling, deep neural networks, hand-crafted audio features, speech emotion recognition.

show abstract

“…Since Google has greatly improved the accuracy of machine translation [24], the attention mechanism is being used more and more in deep learning. In speech recognition, attention mechanism has been used in many tasks such as ASR [25], speaker recognition [26] and SER [2], [20], [21], [27], also in our work.…”

Section: Related Workmentioning

confidence: 99%

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Zhang²,

Zhang³

2021

IEEE Access

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attentionbased convolutional neural network(ACNN) model and conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set. The accuracy is improved to 76.18% (weighted accuracy, WA) and 76.36% (unweighted accuracy, UA). To the best of our knowledge, compared with the state-ofthe-art result on this dataset (76.4% of WA and 70.1% of WA), we achieved a UA improvement of about 6% absolute while achieving a similar WA. Furthermore, We conducted empirical experiments by injecting speech data with 50 types of common noises. We inject the noises by altering the noise intensity, timeshifting the noises, and mixing different noise types, to identify their varied impacts on the SER accuracy and verify the robustness of our model. This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

show abstract

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Cited by 52 publications

References 22 publications

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Contact Info

Product

Resources

About