Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Kim, Jaebok; Englebienne, Gwenn; Truong, Khiet P.; Evers, Vanessa

doi:10.1145/3123266.3123353

Cited by 32 publications

(17 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The network layers extract abstract representa-tions and also filter out the irrelevant information which leads to a more accurate classification [6,7] and better generalisation [8,9]. Temporal models were also proposed for modelling sequential data with mid to long-term dependencies [10,11].…”

Section: Introductionmentioning

confidence: 99%

Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Jalal¹,

Loweimi²,

Moore³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making humanmachine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a stateof-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAO-Aibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks.

show abstract

Section: Introductionmentioning

confidence: 99%

Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Jalal¹,

Loweimi²,

Moore³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…A CNN-LSTM model taking spectrograms as input was also proposed in [1]. CNNs with an atten- tion mechanism were investigated in [21] and [2] proposed an architecture composed of convolutional highway networks and LSTMs. SER models trained on a single corpus tend to overfit, leading to poor performance on out-of-domain data, as presented in [3].…”

Section: Related Workmentioning

confidence: 99%

“…Deep learning architectures, such as Convolutional Neural Networks (CNNs) [1] and highway networks [2], have been shown to yield state-of-the-art performance on this task. However, being able to use these models "in the wild" (i.e.…”

Section: Introductionmentioning

confidence: 99%

Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition

Parry¹,

Palaz

Lecomte³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) is an important and challenging task for human-computer interaction. In the literature deep learning architectures have been shown to yield state-ofthe-art performance on this task when the model is trained and evaluated on the same corpus. However, prior work has indicated that such systems often yield poor performance on unseen data. To improve the generalisation capabilities of emotion recognition systems one possible approach is cross-corpus training, which consists of training the model on an aggregation of different corpora. In this paper we present an analysis of the generalisation capability of deep learning models using crosscorpus training with six different speech emotion corpora. We evaluate the models on an unseen corpus and analyse the learned representations using the t-SNE algorithm, showing that architectures based on recurrent neural networks are prone to overfit the corpora present in the training set, while architectures based on convolutional neural networks (CNNs) show better generalisation capabilities. These findings indicate that (1) cross-corpus training is a promising approach for improving generalisation and (2) CNNs should be the architecture of choice for this approach.

show abstract

“…A multimodal system was built with a late fusion approach. On one side, the spectrograms of the audio cue were used to train a system similar to the one described in (Kim, et al, 2017b). As shown in Figure 2, a convolutional neural network (CNN) was used to extract descriptors from the data.…”

Section: Multimodal Emotion Recognitionmentioning

confidence: 99%

“…A first step towards a fully functional listening agent is to first build a recognition module that would feed a prediction system such as the one mentioned above. For this we utilize both audio and visual signals for emotion recognition using machine learning techniques, such as temporal (Kim & Provost, 2016) or deep learning models (Kim, Provost, & Lee, 2013), (Kim, 2017a(Kim, , 2017b.…”

Section: Introductionmentioning

confidence: 99%