Abstract:In this paper, a novel multiscale amplitude feature is proposed using multiresolution analysis (MRA) and the significance of the vocal tract is investigated for emotion classification from the speech signal. MRA decomposes the speech signal into number of sub-band signals. The proposed feature is computed by using sinusoidal model on each sub-band signal. Different emotions have different impacts on the vocal tract. As a result, vocal tract responds in a unique way for each emotion. The vocal tract information… Show more
“…Current methods of emotion recognition mainly involve facial expression recognition [3][4][5][6], speech emotion recognition [7][8][9], gesture expression recognition [10], text recognition [11], physiological pattern recognition, and multimodal emotion recognition [12][13][14][15]. In practical applications, the non-contact method of extracting physiological parameters for face imaging has attracted special attention.…”
An emotion recognition method based on multispectral imaging technology and tissue oxygen saturation (StO2) is proposed in this study. This method is called spatial-spectral-temporal adjustment convolutional neural network (SACNN). First, we use the algorithm to extract the StO2 content of an emotionally sensitive nose area through real-time multispectral imaging technology. Compared with facial expression data, StO2 data are more objective and cannot be controlled and changed artificially. Second, we construct a clustering algorithm based on the emotional state by extracting the spectral, StO2, and spatial features of the nose image to obtain accurate signals of emotionally sensitive areas. To utilize the correlation between spectral and spatial signals, we propose an adjustment-based CNN module, which reorganizes the relationship between all previous layers of the feature map, thereby making the relationship among layers close and highly quantitative. The features extracted through this method are consistent with spatial-spectral features. Third, we incorporate the extracted temporal feature signal into the long short-term memory module and finally complete the correlation between the spatial-spectral-temporal features. Experimental results show that the accuracy of the SACNN algorithm in emotional recognition reaches 90%, and the proposed method is more competitive than state-of-the-art approaches. To the best of our knowledge, this study is the first to use time-series StO2 signals for emotion recognition. INDEX TERMS Multispectral imaging, oxygen saturation, spatial-spectral-temporal adjustment convolutional neural network. I.
“…Current methods of emotion recognition mainly involve facial expression recognition [3][4][5][6], speech emotion recognition [7][8][9], gesture expression recognition [10], text recognition [11], physiological pattern recognition, and multimodal emotion recognition [12][13][14][15]. In practical applications, the non-contact method of extracting physiological parameters for face imaging has attracted special attention.…”
An emotion recognition method based on multispectral imaging technology and tissue oxygen saturation (StO2) is proposed in this study. This method is called spatial-spectral-temporal adjustment convolutional neural network (SACNN). First, we use the algorithm to extract the StO2 content of an emotionally sensitive nose area through real-time multispectral imaging technology. Compared with facial expression data, StO2 data are more objective and cannot be controlled and changed artificially. Second, we construct a clustering algorithm based on the emotional state by extracting the spectral, StO2, and spatial features of the nose image to obtain accurate signals of emotionally sensitive areas. To utilize the correlation between spectral and spatial signals, we propose an adjustment-based CNN module, which reorganizes the relationship between all previous layers of the feature map, thereby making the relationship among layers close and highly quantitative. The features extracted through this method are consistent with spatial-spectral features. Third, we incorporate the extracted temporal feature signal into the long short-term memory module and finally complete the correlation between the spatial-spectral-temporal features. Experimental results show that the accuracy of the SACNN algorithm in emotional recognition reaches 90%, and the proposed method is more competitive than state-of-the-art approaches. To the best of our knowledge, this study is the first to use time-series StO2 signals for emotion recognition. INDEX TERMS Multispectral imaging, oxygen saturation, spatial-spectral-temporal adjustment convolutional neural network. I.
“…It is evident from the literature that the combination of speech features, i.e. feature fusion, increases the classification accuracy of the SER system [6,23,28] and hence became the most common practice in this field.…”
Section: Continuous Featuresmentioning
confidence: 99%
“…Mel-frequency cepstral coefficients (MFCCs) [11,21,22], linear prediction coefficients (LPCs) [23], relative spectral perceptual linear prediction (RASTA-PLP) [16], and variants of these features like modified MFCC (M-MFCC) [13], feature fusion of MFCC, and short-time energy features with velocity ( ∆) and acceleration ( ∆+∆ ) [23] are some of the well-known spectral features that are used for speech emotion recognition. Apart from these, log frequency power coefficients (LFPCs) [24], Fourier parameter features [25], time-frequency features with AMS-GMM mask [26], modulation spectral features [27], and amplitude-based features [28] are some of the variants of spectral features that are now used in SER analysis.…”
In recent times, much research is progressing forward in the field of speech emotion recognition (SER). Many SER systems have been developed by combining different speech features to improve their performances. As a result, the complexity of the classifier increases to train this huge feature set. Additionally, some of the features could be irrelevant in emotion detection and this leads to a decrease in the emotion recognition accuracy. To overcome this drawback, feature optimization can be performed on the feature sets to obtain the most desirable emotional feature set before classifying the features. In this paper, semi-nonnegative matrix factorization (semi-NMF) with singular value decomposition (SVD) initialization is used to optimize the speech features. The speech features considered in this work are mel-frequency cepstral coefficients, linear prediction cepstral coefficients, and Teager energy operator-autocorrelation (TEO-AutoCorr).This work uses k-nearest neighborhood and support vector machine (SVM) for the classification of emotions with a 5-fold cross-validation scheme. The datasets considered for the performance analysis are EMO-DB and IEMOCAP. The performance of the proposed SER system using semi-NMF is validated in terms of classification accuracy. The results emphasize that the accuracy of the proposed SER system is improved remarkably upon using the semi-NMF algorithm for optimizing the feature sets compared to the baseline SER system without optimization.
“…Multiscale amplitude feature (abbreviate as Mul‐Amp) is a latest provided multi‐resolution feature on time domain in 2018 14 . The multi‐resolution is achieved by wavelet package transformation and subband partition.…”
Section: Experiments and Evaluationmentioning
confidence: 99%
“…Deb and Dandapat extracted a subband amplitude feature by decomposed speech signal into multi‐scale frequency bands and Fourier transform. This feature has a good distinguishing performance on experiments 14 . However, this feature used a uniform partition method to frequency bands, cannot embody the requirement of non‐linear in psychoacoustic model.…”
Summary
Speech emotion recognition is an important technique for human‐computer interface applications. Due to contain rich information of emotion, the spectral feature is widely used for emotion recognition. However, the recognition performance is limited because of imprecise extracted rule and uncertain size of resolution of spectral feature. To address this issue, motivated by speech coding, we introduced psychoacoustics model, provided a perception spectral subband partition method for obtaining more precise frequency resolution. Moreover, we also provided a new spectral feature on the divided subband frequency signals. The proposed feature includes emotional perception entropy, spectral inclination, and spectral flatness. Then, a Support Vector Machine classifier is used to recognize emotion categories. The experiment results show that the proposed spectral feature is superior to the traditional MFCC feature, and also better than the state‐of‐the‐art Fourier feature and multi‐resolution amplitude feature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.