Low-Level Fusion of Audio and Video Feature for Multi-Modal Emotion Recognition

Wimmer, Matthias; Schuller, Björn W.; Arsic, D.; Rigoll, Gerhard; Radig, Bernd

doi:10.5220/0001082801450151

Cited by 13 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In general, DNN-based multimodal models include multiple streams of networks for modalities [14]. The models often have a component for fusing the features [2], [6], [15] to make a prediction based on intermediate features from the streams. Hence, by considering the fusion stage in the model architecture, multimodal architectures can be categorized into early fusion [15], mid-fusion [6], [7], [14], [16], and late fusion [2], [17].…”

Section: Related Work a Multimodal Modelsmentioning

confidence: 99%

“…The early fusion approach integrates the features of different modalities as inputs and uses a unified feature for the downstream task [15]. The mid-fusion approach involves the concatenation of features that are encoded from the raw data of each modality into a single feature [6], [7], [14], [16].…”

Section: Related Work a Multimodal Modelsmentioning

confidence: 99%

“…This encoder outputs a hidden state with dimension d = 1024. In the Production Ad-LP dataset, we used a pre-trained Japanese BERT 15 as the encoder. This encoder outputs a hidden state with dimension d = 768.…”

Section: Appendix a The Details Of The Production Ad-lp Datasetmentioning

confidence: 99%

See 2 more Smart Citations

DM²S²: Deep Multimodal Sequence Sets With Hierarchical Modality Attention

et al. 2022

View full text Add to dashboard Cite

There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM 2 S 2 ). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter-and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.INDEX TERMS Attention mechanism, deep neural networks, multimodal learning.

show abstract