Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

Andayani, Felicia; Theng, Lau Bee; Tsun, Mark TeeKit; Chua, Caslon

doi:10.1109/access.2022.3163856

Cited by 57 publications

(18 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The convolutional layer‐based transformer is motivated by the success of the transformer and its variants in speech‐processing applications [65–67]. Instead of using a conventional transformer, this study uses convolution layers and multi‐head attention blocks to construct this module.…”

Section: Proposed Ser Systemmentioning

confidence: 99%

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Saleem

Gao

Irfan

et al. 2023

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

Speech emotion recognition (SER) is an important research problem in human-computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer-based transformer. Two parallel CNNs are applied to extract the spectral features (2D-CNN) and temporal features (1D-CNN) representations. A 2D-convolution layer-based transformer module extracts spectro-temporal features and concatenates them with features from parallel CNNs. The learnt low-level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high-level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO-BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO-BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO-BD and IEMOCAP datasets.

show abstract

Section: Proposed Ser Systemmentioning

confidence: 99%

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Saleem

Gao

Irfan

et al. 2023

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

show abstract

“…After training, this paper successfully constructs an audio-modal emotion-recognition model based on the "time-distributed CNNs + LSTMs" scheme and records the detailed parameters of each layer in the model. In the test phase, the performance of the model was evaluated using the RAVDESS dataset; six emotions were classified and predicted; and the "time-distributed CNNs + LSTMs" scheme was combined with the "SVM on global statistical features" [49] program and the "hybrid LSTM-transformer model" [50] in a comparative experiment. The specific effects are shown in Table 7 below.…”

Section: Training and Evaluation Of Audio-modal Emotion-recognition M...mentioning

confidence: 99%

Emotion-Recognition Algorithm Based on Weight-Adaptive Thought of Audio and Video

et al. 2023

View full text Add to dashboard Cite

Emotion recognition commonly relies on single-modal recognition methods, such as voice and video signals, which demonstrate a good practicability and universality in some scenarios. Nevertheless, as emotion-recognition application scenarios continue to expand and the data volume surges, single-modal emotion recognition proves insufficient to meet people’s needs for accuracy and comprehensiveness when the amount of data reaches a certain scale. Thus, this paper proposes the application of multimodal thought to enhance emotion-recognition accuracy and conducts corresponding data preprocessing on the selected dataset. Appropriate models are constructed for both audio and video modalities: for the audio-modality emotion-recognition task, this paper adopts the “time-distributed CNNs + LSTMs” model construction scheme; for the video-modality emotion-recognition task, the “DeepID V3 + Xception architecture” model construction scheme is selected. Furthermore, each model construction scheme undergoes experimental verification and comparison with existing emotion-recognition algorithms. Finally, this paper attempts late fusion and proposes and implements a late-fusion method based on the idea of weight adaptation. The experimental results demonstrate the superiority of the multimodal fusion algorithm proposed in this paper. When compared to the single-modal emotion-recognition algorithm, the accuracy of recognition is increased by almost 4%, reaching 84.33%.

show abstract

“…In recent studies, transformer-based SER models have been proposed. Andayani et al [36] proposed a hybrid model that replaced the position encodings of the transformer encoder with LSTM in order to learn contextualized longterm dependencies for emotion recognition. Pre-trained models and data augmentation techniques have also been used to improve SER performance in recent research.…”

Section: Related Workmentioning

confidence: 99%

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

2022

View full text Add to dashboard Cite

The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze longterm dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model's robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets.

show abstract

Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

Abstract: Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Cited by 57 publications

References 25 publications

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Emotion-Recognition Algorithm Based on Weight-Adaptive Thought of Audio and Video

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

Contact Info

Product

Resources

About