Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

Toyoshima, Itsuki; Okada, Yoshito; Ishimaru, Momoko; Uchiyama, Ryunosuke; Tada, Mayu

doi:10.3390/s23031743

Cited by 4 publications

(4 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The accurate and effective extraction of relevant characteristics, as well as the high correlation among these features, are critical elements that significantly affect the effectiveness of the emotion detection system. Contemporary SER approaches have been positively affected by the introduction of several innovative feature extraction methods [ 17 , 18 , 19 , 20 ]. In one study [ 17 ], a deep neural network model for SER that could simultaneously learn both MelSpec and GeMAPS audio features was proposed.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Contemporary SER approaches have been positively affected by the introduction of several innovative feature extraction methods [ 17 , 18 , 19 , 20 ]. In one study [ 17 ], a deep neural network model for SER that could simultaneously learn both MelSpec and GeMAPS audio features was proposed. The three components of the model are the learning of MelSpec in picture format, learning of GeMAPS in vector format, and combining the two to predict emotions.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders

Pulatov,

Oteniyazov,

Makhmudov

et al. 2023

Sensors

View full text Add to dashboard Cite

Understanding and identifying emotional cues in human speech is a crucial aspect of human–computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to architect an innovative framework for speech emotion recognition predicated on spectrograms and semantic feature transcribers, aiming to bolster performance precision by acknowledging the conspicuous inadequacies in extant methodologies and rectifying them. To procure invaluable attributes for speech detection, this investigation leveraged two divergent strategies. Primarily, a wholly convolutional neural network model was engaged to transcribe speech spectrograms. Subsequently, a cutting-edge Mel-frequency cepstral coefficient feature abstraction approach was adopted and integrated with Speech2Vec for semantic feature encoding. These dual forms of attributes underwent individual processing before they were channeled into a long short-term memory network and a comprehensive connected layer for supplementary representation. By doing so, we aimed to bolster the sophistication and efficacy of our speech emotion detection model, thereby enhancing its potential to accurately recognize and interpret emotion from human speech. The proposed mechanism underwent a rigorous evaluation process employing two distinct databases: RAVDESS and EMO-DB. The outcome displayed a predominant performance when juxtaposed with established models, registering an impressive accuracy of 94.8% on the RAVDESS dataset and a commendable 94.0% on the EMO-DB dataset. This superior performance underscores the efficacy of our innovative system in the realm of speech emotion recognition, as it outperforms current frameworks in accuracy metrics.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders

Pulatov,

Oteniyazov,

Makhmudov

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…Feature extraction is a critical step in classification and recognition using deep learning algorithms. The Mel filter bank, Gammatone filter bank and Bark filter bank are used to extract the spectrograms of sound signals in speech recognition and classification studies based on sound signals [24][25][26][27]. In order to extract richer features, a method for extracting fusion spectrograms is proposed, as shown in Figure 2.…”

Section: Arc Sound Feature Extractionmentioning

confidence: 99%

Penetration State Identification of Aluminum Alloy Cold Metal Transfer Based on Arc Sound Signals Using Multi-Spectrogram Fusion Inception Convolutional Neural Network

Yang,

Guan,

Yang

et al. 2023

Electronics

View full text Add to dashboard Cite

The CMT welding process has been widely used for aluminum alloy welding. The weld’s penetration state is essential for evaluating the welding quality. Arc sound signals contain a wealth of information related to the penetration state of the weld. This paper studies the correlation between the frequency domain features of arc sound signals and the weld penetration state, as well as the correlation between Mel spectrograms, Gammatone spectrograms and Bark spectrograms and the weld penetration state. Arc sound features fused with multilingual spectrograms are constructed as inputs to a custom Inception CNN model that is optimized based on GoogleNet for CMT weld penetration state recognition. The experimental results show that the accuracy of the method proposed in this paper for identifying the fusion state of CMT welds in aluminum alloy plates is 97.7%, which is higher than the identification accuracy of a single spectrogram as the input. The recognition accuracy of the customized Inception CNN is improved by 0.93% over the recognition accuracy of GoogleNet. The customized Inception CNN also has high recognition results compared to AlexNet and ResNet.

show abstract

“…As a result, this study focuses on detecting double-compressed (DC) AMR speech signals. The magnitude of the discrete Fourier transform (DFT) of short speech segments, commonly known as the spectrogram representation of speech signals, has found widespread application in various tasks such as speaker recognition [6], speech recognition [7], emotion recognition [8], and audio event detection [9]. Its effectiveness stems from its ability to capture the spectral content variation of the signal over time, making it suitable for use with deep neural networks (DNNs), such as deep convolutional neural networks (CNNs), and long-short-term memory (LSTM) networks.…”

Section: Introductionmentioning

confidence: 99%

Exploring the Effectiveness of the Phase Features on Double Compressed AMR Speech Detection

Büker,

Hanilçi

2024

Applied Sciences

View full text Add to dashboard Cite

Determining whether an audio signal is single compressed (SC) or double compressed (DC) is a crucial task in audio forensics, as it is closely linked to the integrity of the recording. In this paper, we propose the utilization of phase spectrum-based features for detecting DC narrowband and wideband adaptive multi-rate (AMR-NB and AMR-WB) speech. To the best of our knowledge, phase spectrum features have not been previously explored for DC audio detection. In addition to introducing phase spectrum features, we propose a novel parallel LSTM system that simultaneously learns the most representative features from both the magnitude and phase spectrum of the speech signal and integrates both sets of information to further enhance its performance. Analyses demonstrate significant differences between the phase spectra of SC and DC speech signals, suggesting their potential as representative features for DC AMR speech detection. The proposed phase spectrum features are found to perform as well as magnitude spectrum features for the AMR-NB codec, while outperforming the magnitude spectrum in detecting AMR-WB speech. The proposed phase spectrum features yield 8% performance improvement in terms of true positive rate over the magnitude spectrogram features. The proposed parallel LSTM system further improves DC AMR-WB speech detection.

show abstract

Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

Cited by 4 publications

References 48 publications

Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders

Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders

Penetration State Identification of Aluminum Alloy Cold Metal Transfer Based on Arc Sound Signals Using Multi-Spectrogram Fusion Inception Convolutional Neural Network

Exploring the Effectiveness of the Phase Features on Double Compressed AMR Speech Detection

Contact Info

Product

Resources

About