Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Kotti, Margarita; Paternò, Fabio

doi:10.1007/s10772-012-9127-7

Cited by 67 publications

(22 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The best results were achieved for the emotions of sadness and joy, the worst result was received for the emotion of anger (see values in Tables 17 and 18). It is not entirely consistent with the results obtained from other authors using the EMO-DB database for GMM emotion recognition [37][38][39] as well as those published in more complex comparison studies [40,41]. Usually, the best recognized emotions are anger and sadness followed by neutral state, the emotion joy generates the most confusion being recognized as anger [39].…”

Section: Discussion Of Resultssupporting

confidence: 60%

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Přibil

Přibilová²

2013

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

This article analyzes and compares influence of different types of spectral and prosodic features for Czech and Slovak emotional speech classification based on Gaussian mixture models (GMM). Influence of initial setting of parameters (number of mixture components and used number of iterations) for GMM training process was analyzed, too. Subsequently, analysis was performed to find how correctness of emotion classification depends on the number and the order of the parameters in the input feature vector and on the computation complexity. Another test was carried out to verify the functionality of the proposed two-level architecture comprising the gender recognizer and of the emotional speech classifier. Next tests were realized to find dependence of some negative aspect (processing of the input speech signal with too short time duration, the gender of a speaker incorrectly determined, etc.) on the stability of the results generated during the GMM classification process. Evaluations and tests were realized with the speech material in the form of sentences of male and female speakers expressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. In addition, a comparative experiment using the speech data corpus in other language (German) was performed. The mean classification error rate of the whole classifier structure achieves about 21% for all four emotions and both genders, and the best obtained error rate was 3.5% for the sadness style of the female gender. These values are acceptable in this first stage of development of the GMM classifier. On the other hand, the test showed the principal importance of correct classification of the speaker gender in the first level, which has heavy influence on the resulting recognition score of the emotion classification. This GMM classifier should be used for evaluation of the synthetic speech quality after applied voice conversion and emotional speech style transformation.

show abstract

Section: Discussion Of Resultssupporting

confidence: 60%

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Přibil

Přibilová²

2013

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…A model with human-selected feature extraction (HSF) using the same data split and softmax configuration was trained on several widely used manufactured features including fundamental frequency [19], pitch related features [20], energy related features [21], zero crossing rate (ZCR) [21, 22], jitter [21], shimmer [21], and Mel-frequency cepstral coefficients (MFCC) [22–24]. As suggested in [19, 20], we applied the statistical functions including Maximum, Minimum, Range, Mean, Slope, Offset, Stddev, Skewness, Kurtosis, Variance, and Median for these features.…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…In fact, even CNN A alone outperformed HSF, further demonstrating the effectiveness of ConvNet-based feature selection. Although one could fine-tune the manually-selected features [21, 22], doing so would be highly laborious compared to automated ConvNet learning.…”

Section: Evaluation Resultsmentioning

confidence: 99%

Speech Intention Classification with Multimodal Deep Learning

Chen

et al. 2017

Advances in Artificial Intelligence

View full text Add to dashboard Cite

We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

show abstract

“…Possible applications include a callcentre environment, where such an emotion recognition schema can be used to improve the quality of service. Furthermore, by discriminating negative from non-negative emotions, human-computer interaction designers will be able to recognize which parts of the interface are problematic, in the sense that they evoke negative emotions [22]. With respect to the audio, this is extracted from the audio-visual clips as monochannel wav files of a 48kHz sampling rate.…”

Section: Databasementioning

confidence: 99%

Effective emotion recognition in movie audio tracks

Kotti

Stylianou

2017

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper addresses the problem of speech emotion recognition from movie audio tracks. The recently collected Acted Facial Expression in the Wild 5.0 database is used. The aim is to discriminate among angry, happy, and neutral. We extract a relatively small number of features, a subset of which is not commonly used for the emotion recognition task. Those features are fed as input to an ensemble classifier that combines random forests with support vector machines. An accuracy of 65.63% is reported, outperforming a baseline system that uses the K-nearest neighbor classifier and has an accuracy of 56.88%. To verify the suitability of the exploited features, the same ensemble classification schema is applied on the feature set similar those employed in Audio/Visual Emotion Challenge 2011. In the latter case, an accuracy of 61.25% is achieved using a large set of 1582 features, as opposed to just 86 features in our case that lead to a relative improvement of 7.15% in accuracy.

show abstract

Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

Cited by 67 publications

References 49 publications

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Speech Intention Classification with Multimodal Deep Learning

Effective emotion recognition in movie audio tracks

Contact Info

Product

Resources

About