“…In [8], the authors proposed a new speech feature combined with an SVM classifier and evaluated it using the EMODB and CAISA databases. In [39], the authors proposed feature extraction in both vowel and non-vowel regions with extreme learning machine (ELM), which they evaluated with the EMODB and IEMOCAP databases. In [40], the authors proposed a new speech feature combined with an acoustic mask with a likelihood classifier, and they evaluated it using the EMODB database.…”
Section: Confusion Matrix In Three Databasesmentioning
Many works have focused on speech emotion recognition algorithms. However, most rely on the proper selection of speech acoustic features. In this paper, we propose a novel emotion recognition algorithm that does not rely on any speech acoustic features and combines speaker gender information. We aim to benefit from the rich information from speech raw data, without any artificial intervention. In general, speech emotion recognition systems require manual selection of appropriate traditional acoustic features as classifier input for emotion recognition. Utilizing deep learning algorithms, and the network automatically select important information from raw speech signal for the classification layer to accomplish emotion recognition. It can prevent the omission of emotion information that cannot be direct mathematically modeled as a speech acoustic characteristic. We also add speaker gender information to the proposed algorithm to further improve recognition accuracy. The proposed algorithm combines a Residual Convolutional Neural Network (R-CNN) and a gender information block. The raw speech data is sent to these two blocks simultaneously. The R-CNN network obtains the necessary emotional information from the speech data and classifies the emotional category. The proposed algorithm is evaluated on three public databases with different language systems. Experimental results show that the proposed algorithm has 5.6%, 7.3%, and 1.5%, respectively accuracy improvements in Mandarin, English, and German compared with existing highest-accuracy algorithms. In order to verify the generalization of the proposed algorithm, we use FAU and eNTERFACE databases, in these two independent databases, the proposed algorithm can also achieve 85.8% and 71.1% accuracy, respectively.
“…In [8], the authors proposed a new speech feature combined with an SVM classifier and evaluated it using the EMODB and CAISA databases. In [39], the authors proposed feature extraction in both vowel and non-vowel regions with extreme learning machine (ELM), which they evaluated with the EMODB and IEMOCAP databases. In [40], the authors proposed a new speech feature combined with an acoustic mask with a likelihood classifier, and they evaluated it using the EMODB database.…”
Section: Confusion Matrix In Three Databasesmentioning
Many works have focused on speech emotion recognition algorithms. However, most rely on the proper selection of speech acoustic features. In this paper, we propose a novel emotion recognition algorithm that does not rely on any speech acoustic features and combines speaker gender information. We aim to benefit from the rich information from speech raw data, without any artificial intervention. In general, speech emotion recognition systems require manual selection of appropriate traditional acoustic features as classifier input for emotion recognition. Utilizing deep learning algorithms, and the network automatically select important information from raw speech signal for the classification layer to accomplish emotion recognition. It can prevent the omission of emotion information that cannot be direct mathematically modeled as a speech acoustic characteristic. We also add speaker gender information to the proposed algorithm to further improve recognition accuracy. The proposed algorithm combines a Residual Convolutional Neural Network (R-CNN) and a gender information block. The raw speech data is sent to these two blocks simultaneously. The R-CNN network obtains the necessary emotional information from the speech data and classifies the emotional category. The proposed algorithm is evaluated on three public databases with different language systems. Experimental results show that the proposed algorithm has 5.6%, 7.3%, and 1.5%, respectively accuracy improvements in Mandarin, English, and German compared with existing highest-accuracy algorithms. In order to verify the generalization of the proposed algorithm, we use FAU and eNTERFACE databases, in these two independent databases, the proposed algorithm can also achieve 85.8% and 71.1% accuracy, respectively.
“…Two recent works using the audio modality can be found in [53] and [54]. Deb & Dandapat in 2017 proposed a method for speech emotion classification using vowel-like regions (VLRs) and non-vowel-like regions (non-VLRs).…”
The exponential growth of multimodal content in today's competitive business environment leads to a huge volume of unstructured data. Unstructured big data has no particular format or structure and can be in any form, such as text, audio, images, and video. In this paper, we address the challenges of emotion and sentiment modeling due to unstructured big data with different modalities. We first include an up-to-date review on emotion and sentiment modeling including the state-of-the-art techniques. We then propose a new architecture of multimodal emotion and sentiment modeling for big data. The proposed architecture consists of five essential modules: data collection module, multimodal data aggregation module, multimodal data feature extraction module, fusion and decision module, and application module. Novel feature extraction techniques called the divide-and-conquer principal component analysis (Div-ConPCA) and the divide-andconquer linear discriminant analysis (Div-ConLDA) are proposed for the multimodal data feature extraction module in the architecture. The experiments on a multicore machine architecture are performed to validate the performance of the proposed techniques.INDEX TERMS Big data, affective analytics, emotion recognition, sentiment modeling, unstructured data.
“…The MFCC is a widely used spectral feature to speech emotional recognition, 26 which composes of MFCC, delta MFCC, and delta‐delta MFCC, the total of 39 coefficients.…”
Summary
Speech emotion recognition is an important technique for human‐computer interface applications. Due to contain rich information of emotion, the spectral feature is widely used for emotion recognition. However, the recognition performance is limited because of imprecise extracted rule and uncertain size of resolution of spectral feature. To address this issue, motivated by speech coding, we introduced psychoacoustics model, provided a perception spectral subband partition method for obtaining more precise frequency resolution. Moreover, we also provided a new spectral feature on the divided subband frequency signals. The proposed feature includes emotional perception entropy, spectral inclination, and spectral flatness. Then, a Support Vector Machine classifier is used to recognize emotion categories. The experiment results show that the proposed spectral feature is superior to the traditional MFCC feature, and also better than the state‐of‐the‐art Fourier feature and multi‐resolution amplitude feature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.