Speech-oriented negative emotion recognition

He, Liang; Bo, Yuming; Zhao, Gangming

doi:10.1109/chicc.2015.7260187

Cited by 2 publications

(1 citation statement)

References 16 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gender data was also leveraged in [11], and data augmentation obtained through an adversarial network has been reported as a successful strategy [12]. In [13], a much smaller feature set comprising only four statistical values for the estimated pitch, the first two formants, the energy, and the zero-crossing rate (ZCR) were used together with feed-forward MLP neural networks, but trained with a modified backpropagation algorithm based on genetic algorithm (GA) principles, with a focus on negative emotions. A different approach was taken in [14], operating on the raw time-domain audio signal to extract linear prediction descriptors processed through a Gammatone filterbank before being applied to a spiking neural network (SNN) and liquid state machine (LSM) hybrid model.…”

Section: Introductionmentioning

confidence: 99%

Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques

MIHALACHE,

BURILEANU

2023

ROMJIST

View full text Add to dashboard Cite

Speech emotion recognition (SER) is the task of determining the affective content present in speech, a promising research area of great interest in recent years, with important applications especially in the field of forensic speech and law enforcement operations, among others. In this paper, systems based on deep neural networks (DNNs) spanning five levels of complexity are proposed, developed, and tested, including systems leveraging transfer learning (TL) for the top modern image recognition deep learning models, as well as several ensemble classification techniques that lead to significant performance increases. The systems were tested on the most relevant SER datasets: EMODB, CREMAD, and IEMOCAP, in the context of: (i) classification: using the standard full sets of emotion classes, as well as additional negative emotion subsets relevant for forensic speech applications; and (ii) regression: using the continuously valued 2D arousal-valence affect space. The proposed systems achieved state-of-the-art results for the full class subset for EMODB (up to 83% accuracy) and performance comparable to other published research for the full class subsets for CREMAD and IEMOCAP (up to 55% and 62% accuracy). For the class subsets focusing only on negative affective content, the proposed solutions offered top performance vs. previously published state of the art results.

show abstract