Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Ong, Kah Liang; Lee, Chin Poo; Lim, Heng Siong; Lim, Kian Ming; Alqahtani, Ali

doi:10.1109/access.2023.3321122

Cited by 6 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These features serve as the The comprehensive benchmark studies in Table 9, with their respective subset of features, datasets, and accuracy scores, show a snapshot of the broader research landscape in SER. Multiple features based speech emotion recognition systems are proposed considering distinct machine learning models such as voting classifier [19], [61], attentionbased multi-learning model (ABMD) [23], 1D-CNN [26] and MViTv2 [60]. However, these multi-featured emotion recognition systems target a particular region accent.…”

Section: A Discussionmentioning

confidence: 99%

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Ahmad,

Iqbal,

Mohsin Jadoon

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Speech is a powerful means of expressing thoughts, emotions, and perspectives. However, accurately determining the emotions conveyed through speech remains a challenging task. Existing manual methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and response to individuals' emotional states. To address diverse accents, an automated system capable of realtime emotion prediction from human speech is needed. This paper introduces a speech emotion recognition (SER) system that leverages supervised learning techniques to tackle cross-accent diversity. Distinctively, the system extracts a comprehensive set of nine speech features-Zero Crossing Rate, Mel Spectrum, Pitch, Root Mean Square values, Mel Frequency Cepstral Coefficients, chroma-stft, and three spectral features (Centroid, Contrast, and Roll-off) for refined speech signal processing and recognition. Seven machine learning models are employed, encompassing Random Forest, Logistic Regression, Decision Tree, Support Vector Machines, Gaussian Naive Bayes, K-Nearest Neighbors, ensemble learning, and four individual, hybrid deep learning models including Long short-term memory (LSTM) and 1-Dimensional Convolutional Neural Network (1D-CNN) with stratified cross-validation. Audio samples from diverse English regions are combined to train the models. The performance evaluation results of conventional machine learning and deep learning models indicate that the Random Forest-based feature selection model achieves the highest accuracy of up to 76% among the conventional machine learning models. Simultaneously, the 1D-CNN model with stratified cross-validation reaches up to 99% accuracy. The proposed framework enhances the cross-accent emotion recognition accuracy up to 86.3%, 89.87%, 90.27%, and 84.96% by margins of 14.71%, 10.15%, 9.6%, and 16.52% respectively.

show abstract

Section: A Discussionmentioning

confidence: 99%

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Ahmad,

Iqbal,

Mohsin Jadoon

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…This technique visualizes the changes in frequency over time, aiding in understanding the complex structures within audio signals. Designed to mimic the characteristics of human hearing, the Mel scale processes frequencies akin to how humans perceive sound, sharing similarities with MFCC analysis but emphasizing the visual representation of audio signals [14]- [15].…”

Section: B Mel Spectrogram Analysismentioning

confidence: 99%

Emotion Recognition and Multi-class Classification in Music with MFCC and Machine Learning

Yoo,

Hong,

Kim

2024

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

Background music in OTT services significantly enhances narratives and conveys emotions, yet users with hearing impairments might not fully experience this emotional context. This paper illuminates the pivotal role of background music in user engagement on OTT platforms. It introduces a novel system designed to mitigate the challenges the hearing-impaired face in appreciating the emotional nuances of music. This system adeptly identifies the mood of background music and translates it into textual subtitles, making emotional content accessible to all users. The proposed method extracts key audio features, including Mel Frequency Cepstral Coefficients (MFCC), Root Mean Square (RMS), and MEL Spectrograms. It then harnesses the power of leading machine learning algorithms Logistic Regression, Random Forest, AdaBoost, and Support Vector Classification (SVC) to analyze the emotional traits embedded in the music and accurately identify its sentiment. Among these, the Random Forest algorithm, applied to MFCC features, demonstrated exceptional accuracy, reaching 94.8% in our tests. The significance of this technology extends beyond mere feature identification; it promises to revolutionize the accessibility of multimedia content. By automatically generating emotionally resonant subtitles, this system can enrich the viewing experience for all, particularly those with hearing impairments. This advancement not only underscores the critical role of music in storytelling and emotional engagement but also highlights the vast potential of machine learning in enhancing the inclusivity and enjoyment of digital entertainment across diverse audiences.

show abstract

“…Signal intra-pulse sequences are transformed into gray-scaled STFT spectrograms, and a CNN network is designed to extract features and classify the spectrogram [25]. Kah Liang Ong proposes a speech emotion recognition method that combines the Mel spectrogram with the Short-Term Fourier Transform (Mel-STFT) and Improved Multiscale Vision Transformers (MViTv2) [26]. However, Markov transfer field images possess several advantages in comparison with time-frequency maps and GAF transformations.…”

Section: Introductionmentioning

confidence: 99%

Carrier-Free Ultra-Wideband Sensor Target Recognition in the Jungle Environment

Li,

Zhang,

Zhu

et al. 2024

Remote Sensing

View full text Add to dashboard Cite

Carrier-free ultra-wideband sensors have high penetrability anti-jamming solid ability, which is not easily affected by the external environment, such as weather. Also, it has good performance in the complex jungle environment. In this paper, we propose a jungle vehicle identification system based on a carrier-free ultra-wideband sensor. Firstly, a composite jungle environment with the target vehicle is modeled. From this model, the simulation obtains time-domain echoes under the excitation of carrier-free ultra-wideband sensor signals in different orientations. Secondly, the time-domain signals are transformed into MTF images through the Markov transfer field to show the statistical characteristics of the time-domain echoes. At the same time, we propose an improved RepVGG network. The structure of the RepVGG network contains five stages, which consist of several RepVGG Blocks. Each RepVGG Block is created by combining convolutional kernels of different sizes using a weighted sum. We add the self-attention module to the output of stage 0 to improve the ability to extract the features of the MTF map and better capture the complex relationship between characteristics during training. In addition, a self-attention module is added before the linear layer classification output in stage 4 to improve the classification accuracy of the network. Moreover, a combined cross-entropy loss and sparsity penalty loss function helps enhance the performance and accuracy of the network. The experimental results show that the system can recognize jungle vehicle targets well.

show abstract

Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Cited by 6 publications

References 18 publications

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Emotion Recognition and Multi-class Classification in Music with MFCC and Machine Learning

Carrier-Free Ultra-Wideband Sensor Target Recognition in the Jungle Environment

Contact Info

Product

Resources

About