Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition

Automatic speech emotion recognition (SER) remains a difficult task within human-computer interaction, despite increasing interest in the research community. One key challenge is how to effectively integrate short-term characterisation of speech segments with long-term information such as temporal variations. Motivated by the numerical approximation theory of stochastic differential equations (SDEs), we propose the novel use of path signatures. The latter provide a pathwise definition to solve SDEs, for the integration of short speech frames. Furthermore we propose a hierarchical tree structure of path signatures, to capture both global and local information. A simple tree-based convolutional neural network (TBCNN) is used for learning the structural information stemming from dyadic path-tree signatures. Our experimental results on a widely used benchmark dataset demonstrate comparable performance to complex neural network based systems. Index Terms: speech emotion recognition, path signature feature, convolutional neural network 1 Following Rough Path theory notation, a path refers to a continuous function mapping from a compact time interval J := [S, T ] to E := R d .

Section: Speech Emotion Recognitionmentioning

confidence: 99%

A Path Signature Approach for Speech Emotion Recognition

Wang

Liakata

et al. 2019

“…Trigeorgis et al and Panagiotis et al [6,16] propose an end-to-end CNN-LSTM model to capture temporal dynamics in single utterance for emotion prediction. Several recent studies [17,5,15,18,19] explored the attention mechanism to focus on the emotion-salient frames in an utterance. However, these methods perform speech emotion recognition on single speech segment without considering the context information in the dialogue.…”

Section: Related Workmentioning

confidence: 99%

“…Most of previous studies perform speech emotion recognition on single speech segment. Among them, the CNN-LSTM network has achieved the state-of-the-art performance to predict the emotion of a single utterance [4,5,6]. However, emotion is not an instantaneous state.…”

Section: Introductionmentioning

confidence: 99%

Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling

Zhao¹,

Chen²,

Liang³

et al. 2019

In dyadic human-human interactions, a more complex interaction scenario, a person's emotional state can be influenced by both self emotional evolution and the interlocutor's behaviors. However, previous speech emotion recognition studies infer the speaker's emotional state mainly based on the targeted speech segment without considering the above two contextual factors. In this paper, we propose an Attentive Interaction Model (AIM) to capture both self-and interlocutor-context to enhance the speech emotion recognition in the dyadic dialog. The model learns to dynamically focus on long-term relevant contexts of the speaker and the interlocutor via the self-attention mechanism and fuse the adaptive context with the present behavior to predict the current emotional state. We carry out extensive experiments on the IEMOCAP corpus for dimensional emotion recognition in arousal and valence. Our model achieves on par performance with baselines for arousal recognition and significantly outperforms baselines for valence recognition, which demonstrates the effectiveness of the model to select useful contexts for emotion recognition in dyadic interactions.

“…More recently, end-to-end training has dominated in SER due to its joint optimization of feature extractor and classifier. Additionally, intrinsic structures [4] [5] and efficient mechanism such as attention [6] [7] aim to refine emotional information in speech signal and produce more discriminative representations. From the perspective of loss function, however, fewer works are reported in SER despite there are successive state of the arts based on it in other domains [8] [9] [10].…”

Section: Introductionmentioning

confidence: 99%

Towards Discriminative Representations and Unbiased Predictions: Class-Specific Angular Softmax for Speech Emotion Recognition

Li¹,

He²,

Li³

et al. 2019

Speech emotion recognition (SER) is a challenging task: the complex emotional expressions make it difficult to discriminate different emotions; the unbalanced data misleads models to give biased predictions. In this work, we tackle these two problems by the angular softmax loss. First, we replace the vanilla softmax with angular softmax to learn emotional representations with strong discriminant power. Besides, inspired by its novel geometric interpretation, we establish a general calculation model and deduce a concise formula of decision domain. Based on these derivations, we propose our solution to data imbalance: class-specific angular softmax by which we can directly adjust decision domains of different emotion classes. Experimental results on the IEMOCAP corpus indicate significant improvements on two state-of-the-art models therefore demonstrate the effectiveness of our proposed methods.