Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

Zhou, Hengshun; Meng, Debin; Zhang, Yuanyuan; Peng, Xiaojiang; Du, Jun; Wang, Kai; Qiao, Yu

doi:10.1145/3340555.3355713

Cited by 57 publications

(24 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These fusion strategies for combining audio and visual modalities emphasize the most important frames that reveal the subject's emotion. For example, Zhou et al consider CNNs to extract features from the speech spectrogram and several relevant video frames, which are highlighted through various intra-modal fusion strategies (e.g., self-attention, relationattention, perceptron-attention) [22].…”

Section: Related Workmentioning

confidence: 99%

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

et al. 2021

View full text Add to dashboard Cite

Humans express and perceive emotions in a multimodal manner. The multimodal information is intrinsically fused by the human sensory system in a complex manner. Emulating a temporal desynchronisation between modalities, in this paper, we design an end-to-end neural network architecture, called TA-AVN, that aggregates temporal audio and video information in an asynchronous setting in order to determine the emotional state of a subject. The feature descriptors for audio and video representations are extracted using simple Convolutional Neural Networks (CNNs), leading to real-time processing. Undoubtedly, collecting annotated training data remains an important challenge when training emotion recognition systems, both in terms of effort and expertise required. The proposed approach solves this problem by providing a natural augmentation technique that allows achieving a high accuracy rate even when the amount of annotated training data is limited. The framework is tested on three challenging multimodal reference datasets for the emotion recognition task, namely the benchmark datasets CREMA-D and RAVDESS, and one dataset from the FG2020's challenge related to emotion recognition. The results prove the effectiveness of our approach and our end-to-end framework achieves state-of-the-art results on the CREMA-D and RAVDESS datasets.INDEX TERMS Emotion recognition, multimodal data, audiovisual information, augmentation techniques, convolutional neural network, real-time processing.

show abstract

Section: Related Workmentioning

confidence: 99%

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The quality of features extracted from speech data is crucial for the performance of SER. Many existing studies have shown that the performance of SER task can be improved by converting the speech signal sequence data into log-Mels spectrogram data (Zhang et al, 2017;Chen et al, 2018;Dai et al, 2019;Zhao et al, 2019;Zhou et al, 2019;Chen and Zhao, 2020;Zayene et al, 2020).…”

Section: Log-mels Spectrogram Feature Calculationmentioning

confidence: 99%

Inferring Association Between Alcohol Addiction and Defendant's Emotion Based on Sound at Court

Song

Wei

2021

Front. Psychol.

View full text Add to dashboard Cite

Alcohol addiction can lead to health and social problems. It can also affect people's emotions. Emotion plays a key role in human communications. It is important to recognize the people's emotions at the court and infer the association between the people's emotions and the alcohol addiction. However, it is challenging to recognize people's emotions efficiently in the courtroom. Furthermore, to the best of our knowledge, no existing work is about the association between alcohol addiction and people's emotions at court. In this paper, we propose a deep learning framework for predicting people's emotions based on sound perception, named ResCNN-SER. The proposed model combines several neural network-based components to extract the features of the speech signals and predict the emotions. The evaluation shows that the proposed model performs better than existing methods. By applying ResCNN-SER for emotion recognition based on people's voices at court, we infer the association between alcohol addiction and the defendant's emotion at court. Based on the sound source data from 54 trial records, we found that the defendants with alcohol addiction tend to get angry or fearful more easily at court comparing with defendants without alcohol addiction.

show abstract

“…Then using all frames they trained the 3D CNN on the visual data. More recently, Zhou et al [18] explored emotion features in multimodal video classification systems by using attention mechanisms to highlight important emotional features.…”

Section: Related Researchmentioning

confidence: 99%

Deep facial emotion recognition in video using eigenframes

Hajarolasvadi

Demirel

2020

IET image process

View full text Add to dashboard Cite

Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

Cited by 57 publications

References 33 publications

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

Inferring Association Between Alcohol Addiction and Defendant's Emotion Based on Sound at Court

Deep facial emotion recognition in video using eigenframes

Contact Info

Product

Resources

About