Jaebok Kim scite author profile

One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed.

show abstract

Learning spectro-temporal features with 3D CNNs for speech emotion recognition

Kim

Truong

Englebienne

et al. 2017

View full text Add to dashboard Cite

In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract shortterm and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other stateof-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectrotemporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.

show abstract

Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning

Kim¹,

Englebienne²,

Truong³

et al. 2017

Preprint

View full text Add to dashboard Cite

Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Kim

Englebienne

Truong

et al. 2017

View full text Add to dashboard Cite

Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the eld of image classication. Recently, empirical studies suggested that identity skipconnections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a novel architecture which regulates unimpeded feature ows and captures long-term dependencies via gate-based skip-connections and a memory mechanism. Our proposed architecture is compared to other state-of-the-art methods of SER and is evaluated on large aggregated corpora recorded in di erent contexts. Our proposed architecture outperforms the state-of-the-art methods by 9 -15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution. In addition, we examine a variant adopting simpli ed skip-connections of Residual Networks (ResNet) and show that gate-based skip-connections are more e ective than simpli ed skip-connections.

show abstract

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Kim

Park

2016

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jaebok Kim

Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning

Learning spectro-temporal features with 3D CNNs for speech emotion recognition

Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning

Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Contact Info

Product

Resources

About