Direct Modelling of Speech Emotion from Raw Speech

Latif, Siddique; Rana, Rajib; Khalifa, Sara; Jurdak, Raja; Epps, Julien

doi:10.21437/interspeech.2019-3252

Cited by 82 publications

(60 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Through the improvement of technologies, artificial intelligence and CNNs are the most popular sources that have achieved excessive success in many fields, such as handwriting recognition [ 28 ], object recognition [ 23 ], natural language processing [ 29 , 30 ], and SER [ 31 ]. The convolutional neural networks addressed the scalability issues of the traditional neural networks [ 32 , 33 ] by allowing them to share similar weights for multiple regions of the inputs [ 34 ]. Usually, the CNN model consists of three main building blocks that first include the convolution layers, second the pooling layers, and finally the fully connected layers.…”

Section: Methodsmentioning

confidence: 99%

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Tursunov

Mustaqeem

Kwon

2020

Sensors

121

View full text Add to dashboard Cite

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.

show abstract

Section: Methodsmentioning

confidence: 99%

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Tursunov

Mustaqeem

Kwon

2020

Sensors

121

View full text Add to dashboard Cite

show abstract

“…By taking into account both the amount of training data and the network complexity, it is understandable that the segment duration of 250 ms turned out to be the best choice in our search for the optimal segment duration for the end-to-end systems. The method used in this work for choosing the optimal segment duration has also been adopted in [66] and [67].…”

Section: Pathological Voice Detection Using An End-to-end Systemmentioning

confidence: 99%

Glottal Source Information for Pathological Voice Detection

Narendra

Alku

2020

IEEE Access

View full text Add to dashboard Cite

Automatic methods for the detection of pathological voice from healthy speech can be considered as potential clinical tools for medical treatment. This study investigates the effectiveness of glottal source information in the detection of pathological voice by comparing the classical pipeline approach to the end-to-end approach. The traditional pipeline approach consists of a feature extractor and a separate classifier. In the former, two sets of glottal features (computed using the quasi-closed phase glottal inverse filtering method) are used together with the widely used openSMILE features. Using both the glottal and openSMILE features extracted from voice utterances and the corresponding healthy/pathology labels, support vector machine (SVM) classifiers are trained. In building end-to-end systems, both raw speech signals and raw glottal flow waveforms are used to train two deep learning architectures: (1) a combination of convolutional neural network (CNN) and multilayer perceptron (MLP), and (2) a combination of CNN and long short-term memory (LSTM) network. Experiments were carried out using three publicly available databases, including dysarthric (the UA-Speech database and the TORGO database) and dysphonic voices (the UPM database). The performance analysis of the detection system based on the traditional pipeline approach showed best results when the glottal features were combined with the baseline openSMILE features. The results of the end-to-end approach indicated higher accuracies (about 2-3 % improvement in all three databases) when glottal flow was used as the raw time-domain input (87.93 % for UA-Speech, 81.12 % for TORGO and 76.66 % for UPM) compared to using raw speech waveform (85.12 % for UA-Speech, 78.83 % for TORGO and 73.71 % for UPM). The evaluation of both approaches demonstrate that automatic detection of pathological voice from healthy speech benefits from using glottal source information. INDEX TERMS Pathological voice, glottal source waveform, glottal features, support vector machines, end-to-end systems.

show abstract

“…The related works in [6][7][8][9] proposed different mechanism to improve the performance of speech emotion recognition in normal environment. Speech emotion recognition system using CNN with the improvement of CapsNets are proposed in [6] by using IEMOCAP dataset and proved that CapsNets get the better performance than baseline CNNs in building the recognition model.…”

Section: Related Workmentioning

confidence: 99%

“…Speech emotion recognition system using CNN with the improvement of CapsNets are proposed in [6] by using IEMOCAP dataset and proved that CapsNets get the better performance than baseline CNNs in building the recognition model. The groups of [7] and [8] also used CNN based classifiers that leads to reliable improvements in accuracy of the speed emotion recognition model and two emotion dataset of IEMOCAP and MSP-IMPROV for unbalanced speed with unsupervised learning and for raw speed. The system used the Bag-of-Visual Words as the classification model on Audio Segment Spectrograms is proposed by the groups [9].…”

Section: Related Workmentioning

confidence: 99%

Emotion Recognition System of Noisy Speech in Real World Environment

Win¹,

Khine²

2020

IJIGSP

View full text Add to dashboard Cite

Speech is one of the most natural and fundamental means of human computer interaction and the state of human emotion is important in various domains. The recognition of human emotion is become essential in real world application, but speed signal is interrupted with various noises from the real world environments and the recognition performance is reduced by these additional signals of noise and emotion. Therefore this paper focuses to develop emotion recognition system for the noisy signal in the real world environment. Minimum Mean Square Error, MMSE is used as the enhancement technique, Mel-frequency Cepstrum Coefficients (MFCC) features are extracted from the speech signals and the state of the arts classifiers used to recognize the emotional state of the signals. To show the robustness of the proposed system, the experimental results are carried out by using the standard speech emotion database, IEMOCAP, under various SNRs level from 0db to 15db of real world background noise. The results are evaluated for seven emotions and the comparisons are prepared and discussed for various classifiers and for various emotions. The results indicate which classifier is the best for which emotion to facilitate in real world environment, especially in noisiest condition like in sport event.

show abstract

Direct Modelling of Speech Emotion from Raw Speech

Cited by 82 publications

References 38 publications

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Glottal Source Information for Pathological Voice Detection

Emotion Recognition System of Noisy Speech in Real World Environment

Contact Info

Product

Resources

About