2021
DOI: 10.1109/access.2021.3116530
|View full text |Cite
|
Sign up to set email alerts
|

An End-To-End Emotion Recognition Framework Based on Temporal Aggregation of Multimodal Information

Abstract: Humans express and perceive emotions in a multimodal manner. The multimodal information is intrinsically fused by the human sensory system in a complex manner. Emulating a temporal desynchronisation between modalities, in this paper, we design an end-to-end neural network architecture, called TA-AVN, that aggregates temporal audio and video information in an asynchronous setting in order to determine the emotional state of a subject. The feature descriptors for audio and video representations are extracted usi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 34 publications
0
14
0
Order By: Relevance
“…That is different from late fusion of modalities [27], [28] or building temporal features to extract global information by assuming that emotions are expressed simultaneously [26]. Late fusion is favorably applied by concatenating the learned features of all modalities in [27], [28] or with a pairwise scheme in [26]. Instead, the authors of M3ER [29] propose a data-driven multiplicative fusion method to combine the modalities, which learns to emphasize the more reliable cues and suppresses the others by integrating Canonical Correlation Analysis as a pre-processing step.…”
Section: Related Workmentioning
confidence: 85%
See 3 more Smart Citations
“…That is different from late fusion of modalities [27], [28] or building temporal features to extract global information by assuming that emotions are expressed simultaneously [26]. Late fusion is favorably applied by concatenating the learned features of all modalities in [27], [28] or with a pairwise scheme in [26]. Instead, the authors of M3ER [29] propose a data-driven multiplicative fusion method to combine the modalities, which learns to emphasize the more reliable cues and suppresses the others by integrating Canonical Correlation Analysis as a pre-processing step.…”
Section: Related Workmentioning
confidence: 85%
“…Early works adapt classifiers like SVMs, Linear and Logistic Regression [39], [40] while, by the time bigger datasets were developed, deep learning architectures were also explored. For example, [27] is based on CNNs, and [26], [28] use RNNs. Some recent studies [41], [14], [16] adopt Transformers.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Classifiers are usually built on one of the known artificial intelligence tools and algorithms, including decision trees, neural networks, Bayesian networks, linear discriminate analysis, linear logistic regression, Support Vector Machine, Hidden Markov Models [28], and lately also convolutional networks [26]. Deep learning becomes a new trend in emotion recognition, especially applied to facial expression recognition [30,14,29]. Depending on the classification method, input channels and selected features, the accuracy of affect recognition differs significantly, sometimes achieving more than 90 percent, but mostly in a cross-validation scheme on a single dataset.…”
Section: Related Workmentioning
confidence: 99%