Proceedings of the Third International Conference on Computer Vision Theory and Applications 2008
DOI: 10.5220/0001082801450151
|View full text |Cite
|
Sign up to set email alerts
|

Low-Level Fusion of Audio and Video Feature for Multi-Modal Emotion Recognition

Abstract: Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn-or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We ther… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 13 publications
(3 citation statements)
references
References 14 publications
0
3
0
Order By: Relevance
“…In general, DNN-based multimodal models include multiple streams of networks for modalities [14]. The models often have a component for fusing the features [2], [6], [15] to make a prediction based on intermediate features from the streams. Hence, by considering the fusion stage in the model architecture, multimodal architectures can be categorized into early fusion [15], mid-fusion [6], [7], [14], [16], and late fusion [2], [17].…”
Section: Related Work a Multimodal Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…In general, DNN-based multimodal models include multiple streams of networks for modalities [14]. The models often have a component for fusing the features [2], [6], [15] to make a prediction based on intermediate features from the streams. Hence, by considering the fusion stage in the model architecture, multimodal architectures can be categorized into early fusion [15], mid-fusion [6], [7], [14], [16], and late fusion [2], [17].…”
Section: Related Work a Multimodal Modelsmentioning
confidence: 99%
“…The early fusion approach integrates the features of different modalities as inputs and uses a unified feature for the downstream task [15]. The mid-fusion approach involves the concatenation of features that are encoded from the raw data of each modality into a single feature [6], [7], [14], [16].…”
Section: Related Work a Multimodal Modelsmentioning
confidence: 99%
See 1 more Smart Citation