Speech is an effective way for communicating and exchanging complex information between humans. Speech signal has involved a great attention in human-computer interaction. Therefore, emotion recognition from speech has become a hot research topic in the field of interacting machines with humans. In this paper, we proposed a novel speech emotion recognition system by adopting multivariate time series handcrafted feature representation from speech signals. Bidirectional echo state network with two parallel reservoir layers has been applied to capture additional independent information. The parallel reservoirs produce multiple representations for each direction from the bidirectional data with two stages of concatenation. The sparse random projection approach has been adopted to reduce the high-dimensional sparse output for each direction separately from both reservoirs. Random over-sampling and random under-sampling methods are used to overcome the imbalanced nature of the used speech emotion datasets. The performance of the proposed parallel ESN model is evaluated from the speaker-independent experiments on EMO-DB, SAVEE, RAVDESS, and FAU Aibo datasets. The results show that the proposed SER model is superior to the single reservoir and the state-of-the-art studies.
Speech is an effective, quick, and important way for communicating and exchanging complex information between humans. Emotions have always been a part of normal human conversation which makes the speech more attractive. Because of this major role of both speech and emotion, many researchers are inspired by studying Speech Emotion Recognition (SER) which still has plenty of challenges. In this study, we proposed a novel reservoir computing approach with the initialization of random connection weights for the input weight by the truncated normal distribution. Furthermore, Population-Based Training (PBT) is adopted to optimize the hyperparameters of the whole Echo State Network (ESN) model which have a significant impact on the model performance. The proposed model has adopted bidirectional reservoir input to increase the memorization capability, and Sparse Random Projection (SRP) was applied for dimensional reduction as a simple, unsupervised, and low complexity approach. The speaker-independent strategy was employed on EMODB and SAVEE datasets as an acted speech emotion dataset and Aibo as a non-acted dataset. The model achieved 84.8%, 65.95%, and 45.99% unweighted average recalls on the EMODB, SAVEE, and Aibo datasets respectively. The results show that the proposed model outperforms the recent state-of-the-art studies with a cheaper computational cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.