Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile

Li, Jeng-Lin; Lee, Chi-Chun

doi:10.21437/interspeech.2019-2044

Cited by 22 publications

(14 citation statements)

References 22 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies employ multi-task learning to construct gender-dependent models without inputting speaker attributes [18,27]. Personal profiles have also been utilized to estimate speaker-dependent emotion recognition [28]. In this paper, we do not employ speaker adaptation in order to investigate the influence of just listener dependency; it will be possible, however, to combine the proposed LD model with existing speaker adaptation methods.…”

Section: I R E L a T E D W O R Kmentioning

confidence: 99%

Speech emotion recognition based on listener-dependent emotion perception models

Ando

Mori

Kobashikawa

et al. 2021

SIP

View full text Add to dashboard Cite

This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.

show abstract

Section: I R E L a T E D W O R Kmentioning

confidence: 99%

Speech emotion recognition based on listener-dependent emotion perception models

Ando

Mori

Kobashikawa

et al. 2021

SIP

View full text Add to dashboard Cite

show abstract

“…Recently, more research efforts focused on auxiliary information and innovative ways to assist emotion recognition. For example, transcripts, language cues and cross-culture information were adopted in emotion recognition [25], [36], [37]. In [38], conditioned data augmentation using generative adversarial networks (GANs) was explored to address the problem of data imbalance in SER tasks.…”

Section: Related Work a Audio-based Emotion Recognitionmentioning

confidence: 99%

“…[24] proposed to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with convolutional neural networks (CNNs) and 3D-CNN, then fuses them in deep belief networks (DBNs). In [25], a concatenation of different modalities was performed after an encoder which yielded significant improvements. In our recent work [26], we introduced global-trunk based factorized bilinear pooling (G-FBP) to integrate the audio and visual features, achieving a state-of-the-art performance.…”

Section: Introductionmentioning

confidence: 99%

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Zhou¹,

Du²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating selfattention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy (AG-FBP) to dynamically calculate the fusion weight of two modalities is devised based on the emotion-related representation vectors from the attention mechanism of respective modalities. Finally, to fully utilize the local emotion information, adaptive and multi-level FBP (AM-FBP) is introduced by combining both global-trunk and intratrunk data in one recording on top of AG-FBP. Tested on the IEMOCAP corpus for speech emotion recognition with only audio stream, the new FCN method outperforms the state-ofthe-art results with an accuracy of 71.40%. Moreover, validated on the AFEW database of EmotiW2019 sub-challenge and the IEMOCAP corpus for audio-visual emotion recognition, the proposed AM-FBP approach achieves the best accuracy of 63.09% and 75.49% respectively on the test set .

show abstract

“…Features obtained from each model were fused using a DNN to classify the emotion. Li et al [8] proposed a personalized attribute aware attention mechanism where an attention profile is learned based arXiv:2009.10991v1 [eess.AS] 23 Sep 2020 on acoustic and lexical behavior data. Mirsamadi et al [15] used deep learning along with local attention to automatically extract relevant features where segment level acoustic features are aggregated for utterance level emotion representation.…”

Section: Introductionmentioning

confidence: 99%

Attention Driven Fusion for Multi-Modal Emotion Recognition

Priyasad

Fernando

Denman

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN. This approach learns filter banks tuned for emotion recognition and provides more effective features compared to directly applying convolutions over the raw speech signal. For text processing, we use two branches (a DCNN and a Bidirection RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations on hidden representations received from the Bi-RNN. Following existing state-of-the-art, we evaluate the performance of the proposed system on the IEMOCAP dataset. Experimental results indicate that the proposed system outperforms existing methods, achieving 3.5% 1 improvement in weighted accuracy.

show abstract

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile

Cited by 22 publications

References 22 publications

Speech emotion recognition based on listener-dependent emotion perception models

Speech emotion recognition based on listener-dependent emotion perception models

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Attention Driven Fusion for Multi-Modal Emotion Recognition

Contact Info

Product

Resources

About