Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Tursunov, Anvarjon; Mustaqeem,; Kwon, Soonil

doi:10.3390/s20185212

Cited by 110 publications

(40 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrary to that, there are few articles testing and analyzing the behavior of specific features and their settings in given conditions, e.g., testing frequency ranges or scales [ 44 ], etc. Nevertheless, published results measured even on the same database vary a lot, e.g., from approximately 50% [ 45 ] to even 92% [ 46 ], mainly due to the experimental set up, evaluation, processing, and classification. This study differs as it provides a unified, complex, and statistically rigorous analysis of great variety of basic speech properties, features and their settings, and calculation methods related to SER, by means of the machine learning.…”

Section: Discussionmentioning

confidence: 99%

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

Kačur

Puterka

Pavlovičová

et al. 2021

Sensors

View full text Add to dashboard Cite

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.

show abstract

Section: Discussionmentioning

confidence: 99%

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

Kačur

Puterka

Pavlovičová

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…The method presented in this study consists of acoustic features, deep features, pre-trained CNN and SVM combined model. In many studies, acoustic and deep features are used separately [11], [12], [16], [17]. In this study, acoustic and deep features are combined to improve the semantic information of the emotion features in the speech.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…Generally, the essential feature parameters utilized in the speech emotion recognition system can be separated into two categories in terms of conventional features and deep features. Features extracted from Convolutional Neural Network (CNN) layers are generally used as deep features [11], [12]. In [13], to recognize emotions from speech, a method that is based on MFCC features and Gaussian mixture model classifier is proposed.…”

Section: Related Workmentioning

confidence: 99%

A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features

2020

IEEE Access

View full text Add to dashboard Cite

“…Recent SER models based on deep-learning architectures [ 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 ] have demonstrated state-of-the-art performance with an attention mechanism [ 19 , 20 , 22 , 23 , 25 , 26 ]. The deep-learning architectures adopted in previous studies included recurrent neural networks (RNN) [ 19 ], convolutional neural networks (CNN) [ 24 ], and convolutional RNNs (CRNN) [ 20 , 26 ]. Liu et al [ 21 ] presented an SER model of a decision tree for an extreme learning machine having a single hidden-layer feed-forward neural network, using a mixture of deep learning and typical classification techniques.…”

Section: Related Workmentioning

confidence: 99%

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Noh

Jeong

Lim

et al. 2021

Sensors

View full text Add to dashboard Cite

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

show abstract

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Cited by 110 publications

References 53 publications

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Contact Info

Product

Resources

About