Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition

Rudd, David Hason; Huo, Huan; Xu, Guandong

doi:10.1007/978-3-031-05936-0_31

Cited by 6 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In another work, Hason Rudd et al (2022) [10] first converted the speech signals into the mel spectrogram representation. Then, they applied a VGG16 model to extract feature maps with various dimensions and signal sampling ratios.…”

Section: Related Workmentioning

confidence: 99%

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Ong,

Lee,

Lim

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.

show abstract

Section: Related Workmentioning

confidence: 99%

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Ong,

Lee,

Lim

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Al-onazi et al [39] also proposed to augment a combination of first and second delta MFCCs, chroma grams, tonnetz, and spectral contrast which were used as input to a transformer-based SER model. In [40], harmonic and percussive components were extracted from the mel spectrograms and later concatenated with the mel spectrograms to form augmented input to a pre-trained VGG16 model for SER. They also experimented with a combination of MFCCs, mel spectrograms and chroma grams.…”

Section: Related Workmentioning

confidence: 99%

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

2022

View full text Add to dashboard Cite

The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze longterm dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model's robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets.

show abstract

An Extended Variational Mode Decomposition Algorithm Developed Speech Emotion Recognition Performance

Rudd

Huo

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Emotion recognition (ER) from speech signals is a robust approach since it cannot be imitated like facial expression or text based sentiment analysis. Valuable information underlying the emotions are significant for human-computer interactions enabling intelligent machines to interact with sensitivity in the real world. Previous ER studies through speech signal processing have focused exclusively on associations between different signal mode decomposition methods and hidden informative features. However, improper decomposition parameter selections lead to informative signal component losses due to mode duplicating and mixing. In contrast, the current study proposes VGG-optiVMD, an empowered variational mode decomposition algorithm, to distinguish meaningful speech features and automatically select the number of decomposed modes and optimum balancing parameter for the data fidelity constraint by assessing their effects on the VGG16 flattening output layer. Various feature vectors were employed to train the VGG16 network on different databases and assess VGG-optiVMD reproducibility and reliability. One, two, and three-dimensional feature vectors were constructed by concatenating Mel-frequency cepstral coefficients, Chromagram, Mel spectrograms, Tonnetz diagrams, and spectral centroids. Results confirmed a synergistic relationship between the fine-tuning of the signal sample rate and decomposition parameters with classification accuracy, achieving state-of-the-art 96.09% accuracy in predicting seven emotions on the Berlin EMO-DB database.

show abstract

Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition

Cited by 6 publications

References 28 publications

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

An Extended Variational Mode Decomposition Algorithm Developed Speech Emotion Recognition Performance

Contact Info

Product

Resources

About