Exploring Fusion Strategies in Deep Learning Models for Multi-Modal Classification

Zhang, Duoyi; Nayak, Richi; Bashar, MA

doi:10.1007/978-981-16-8531-6_8

Cited by 9 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experiments with more complex feature fusion models have also been carried out. The co-attention and cross-attention techniques proposed in Zhang et al [93] did not improve the results compared to the selected fusion method. Furthermore, we investigate whether using a long short-term memory (LSTM) network in the output of the text model, as suggested by Gallo et al [94], yields improved classification results; however, this did not happen.…”

Section: Multimodal Classificationmentioning

confidence: 81%

Multimodal Fine-Grained Grocery Product Recognition Using Image and Ocr Text

Pettersson

Riveiro

Löfström

2023

Preprint

View full text Add to dashboard Cite

Section: Multimodal Classificationmentioning

confidence: 81%

Multimodal Fine-Grained Grocery Product Recognition Using Image and Ocr Text

Pettersson

Riveiro

Löfström

2023

Preprint

View full text Add to dashboard Cite

“…The resulting Q contains information from the static modality that is correlated with the time series modality. Unlike the aforementioned strategies, an attention-based mechanism can accurately model correlated parts between modalities [57]. Ideally, this enables the model to learn only the valuable information from the static modality.…”

Section: Attention-based Fusionmentioning

confidence: 99%

Joint Representation Learning with Generative Adversarial Imputation Network for Improved Classification of Longitudinal Data

Pingi,

Zhang,

Bashar

et al. 2023

Data Sci. Eng.

Self Cite

View full text Add to dashboard Cite

Generative adversarial networks (GANs) have demonstrated their effectiveness in generating temporal data to fill in missing values, enhancing the classification performance of time series data. Longitudinal datasets encompass multivariate time series data with additional static features that contribute to sample variability over time. These datasets often encounter missing values due to factors such as irregular sampling. However, existing GAN-based imputation methods that address this type of data missingness often overlook the impact of static features on temporal observations and classification outcomes. This paper presents a novel method, fusion-aided imputer-classifier GAN (FaIC-GAN), tailored for longitudinal data classification. FaIC-GAN simultaneously leverages partially observed temporal data and static features to enhance imputation and classification learning. We present four multimodal fusion strategies that effectively extract correlated information from both static and temporal modalities. Our extensive experiments reveal that FaIC-GAN successfully exploits partially observed temporal data and static features, resulting in improved classification accuracy compared to unimodal models. Our post-additive and attention-based multimodal fusion approaches within the FaIC-GAN model consistently rank among the top three methods for classification.

show abstract

“…Late fusion, on the other hand, processes each modality separately and fuses the resulting logits or decision scores. Various techniques, from simple averaging to attention-based methods, are used in existing works [3,22,24,32,37,38]. The fusion strategy significantly impacts the system's robustness and accuracy, especially when modalities provide conflicting cues.…”

Section: Multi-modal Features Fusionmentioning

confidence: 99%

Multi-View Transformer for 3D Visual Grounding

Huang

Chen

Jia

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent.In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions. The source code and additional resources for this project are available on GitHub: https://github.com/dfki-av/MiKASA-3DVG

show abstract

Exploring Fusion Strategies in Deep Learning Models for Multi-Modal Classification

Cited by 9 publications

References 17 publications

Multimodal Fine-Grained Grocery Product Recognition Using Image and Ocr Text

Multimodal Fine-Grained Grocery Product Recognition Using Image and Ocr Text

Joint Representation Learning with Generative Adversarial Imputation Network for Improved Classification of Longitudinal Data

Multi-View Transformer for 3D Visual Grounding

Contact Info

Product

Resources

About