2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) 2021
DOI: 10.1109/icbaie52039.2021.9389877
|View full text |Cite
|
Sign up to set email alerts
|

Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…To uniformly process these features of different dimensions, an autoencoder (AE) is designed for each modality to transform their dimensions to 512. Equations (5) (6) show the process of encoding the original features. Here, F text and F image respectively represent textual features and global image features.…”
Section: A Feature Alignmentmentioning
confidence: 99%
See 1 more Smart Citation
“…To uniformly process these features of different dimensions, an autoencoder (AE) is designed for each modality to transform their dimensions to 512. Equations (5) (6) show the process of encoding the original features. Here, F text and F image respectively represent textual features and global image features.…”
Section: A Feature Alignmentmentioning
confidence: 99%
“…(2) Difficulty in aligning features across different modalities. The multimodal feature fusion and multi-level attention mechanism introduced by Linqin et al [6] in "Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism" enhances semantic information and accurately captures image features. However, the significant differences in feature representation and semantics between image and text features still pose a substantial challenge.…”
Section: Introductionmentioning
confidence: 99%
“…Multimodal feature fusion was first pioneered in the field of visual question answering. 18 Multimodal fusion refers to obtain information from text, image, voice, video, and other fields to realize information conversion and fusion, so as to enhance the data acceptance ability of the model. It is a typical interdisciplinary field.…”
Section: Multimodal Feature Fusionsmentioning
confidence: 99%