Multi-Attention Fusion Network for Video-based Emotion Recognition

Wang, Yanan; Wu, Jianming; Hoashi, Keiichiro

doi:10.1145/3340555.3355720

Cited by 25 publications

(13 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As for "Surprise" and "Disgust", the worse performance might be due to a potential mixing of different emotions, making these emotion categories not easy to be correctly classified. We also observe that the proportion of these two emotions is the lowest in the training set, and similar results are also found in [22], [29], [30], [77]. Finally, the proposed methods are further evaluated on the IEMOCAP database.…”

Section: E Overall Comparisonsupporting

confidence: 72%

“…To improve emotion recognition performances, the mouth area was further divided into several subregions, as elaborated in [53], to extract LBP-TOP features from each subregion and concatenate the respective features. In [30] a multiple attention fusion network (MAFN) was proposed by modeling human emotion recognition mechanisms.…”

Section: Audio-visual Based Emotion Recognitionmentioning

confidence: 99%

“…In the upper block of Table V, we show the performance of different systems on the AFEW test set. MAFN [30] is a multi-modal adaption method with intra/inter-modality attention mechanisms. "4CNNs+LMED+DL-A+LST" [29] combined five vi- sual and two audio models and obtained the best accuracy in EmotiW2018.…”

Section: E Overall Comparisonmentioning

confidence: 99%

“…To address these issues, the EmotiW [7] challenges have been held successfully since 2013. The winning teams [23], [28]- [30] have proposed several advanced techniques for AVER and achieved better results every year, which further investigated on how to effectively model different modalities for information fusion. In this study, we comprehensively extend our previous G-FBP approach [26] and propose an attention network for multimodal fusion for AVER based on adaptive and multi-level FBP as shown in Fig.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Zhou¹,

Du²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multimodal emotion recognition is a challenging task in emotion computing as it is quite difficult to extract discriminative features to identify the subtle differences in human emotions with abstract concept and multiple expressions. Moreover, how to fully utilize both audio and visual information is still an open problem. In this paper, we propose a novel multimodal fusion attention network for audio-visual emotion recognition based on adaptive and multi-level factorized bilinear pooling (FBP). First, for the audio stream, a fully convolutional network (FCN) equipped with 1-D attention mechanism and local response normalization is designed for speech emotion recognition. Next, a global FBP (G-FBP) approach is presented to perform audio-visual information fusion by integrating selfattention based video stream with the proposed audio stream. To improve G-FBP, an adaptive strategy (AG-FBP) to dynamically calculate the fusion weight of two modalities is devised based on the emotion-related representation vectors from the attention mechanism of respective modalities. Finally, to fully utilize the local emotion information, adaptive and multi-level FBP (AM-FBP) is introduced by combining both global-trunk and intratrunk data in one recording on top of AG-FBP. Tested on the IEMOCAP corpus for speech emotion recognition with only audio stream, the new FCN method outperforms the state-ofthe-art results with an accuracy of 71.40%. Moreover, validated on the AFEW database of EmotiW2019 sub-challenge and the IEMOCAP corpus for audio-visual emotion recognition, the proposed AM-FBP approach achieves the best accuracy of 63.09% and 75.49% respectively on the test set .

show abstract

Section: E Overall Comparisonsupporting

confidence: 72%

Section: Audio-visual Based Emotion Recognitionmentioning

confidence: 99%

Section: E Overall Comparisonmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Zhou¹,

Du²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Secondly, the relationship features of different layers are fully utilized by bidirectional RNN with self-attention. Wang et al [21] defined a multimodal domain adaptive method to obtain the interaction between modes. e performance of emotion recognition is evaluated by using different architectures CNN and different CNN feature layers in paper [11].…”

Section: Recognizing Emotion From Videosmentioning

confidence: 99%

Context-Aware Attention Network for Human Emotion Recognition in Video

Liu

Wang

2020

Advances in Multimedia

View full text Add to dashboard Cite

Recognition of human emotion from facial expression is affected by distortions of pictorial quality and facial pose, which is often ignored by traditional video emotion recognition methods. On the other hand, context information can also provide different degrees of extra clues, which can further improve the recognition accuracy. In this paper, we first build a video dataset with seven categories of human emotion, named human emotion in the video (HEIV). With the HEIV dataset, we trained a context-aware attention network (CAAN) to recognize human emotion. The network consists of two subnetworks to process both face and context information. Features from facial expression and context clues are fused to represent the emotion of video frames, which will be then passed through an attention network and generate emotion scores. Then, the emotion features of all frames will be aggregated according to their emotional score. Experimental results show that our proposed method is effective on HEIV dataset.

show abstract