Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism

Cai, Linqin; Zhongxu, Liao; Sitong, Zhou; Chen, Kejia

doi:10.1109/icbaie52039.2021.9389877

Cited by 4 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To uniformly process these features of different dimensions, an autoencoder (AE) is designed for each modality to transform their dimensions to 512. Equations (5) (6) show the process of encoding the original features. Here, F text and F image respectively represent textual features and global image features.…”

Section: A Feature Alignmentmentioning

confidence: 99%

“…(2) Difficulty in aligning features across different modalities. The multimodal feature fusion and multi-level attention mechanism introduced by Linqin et al [6] in "Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism" enhances semantic information and accurately captures image features. However, the significant differences in feature representation and semantics between image and text features still pose a substantial challenge.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PiTeQA-Net: Picto-Textual Question Answering Network for Multimodal Document Image Analysis

Jiaming

2024

Preprint

View full text Add to dashboard Cite

By integrating visual and natural language understanding, Visual Question Answering (VQA) holds promise for enhancing the intelligence of computer systems, thereby improving user work efficiency. However, current research in the VQA field faces two major challenges: low precision in feature pre-extraction and low efficiency in feature fusion. This study proposes a pre-trained framework based on RSFE-TBT for question-answering tasks aimed at improving feature pre-extraction accuracy and feature fusion efficiency across different modalities. Addressing the issue of low pre-extraction accuracy in existing methods, ResNet50-SF is proposed for pre-extracting image features. Considering ResNet50's limitations in recognizing small objects and specific spatial positions, a bidirectional feature pyramid network(BiFPN) with spatial attention is introduced. Text block features bridging image and text are introduced to address challenges in aligning multi-modal features and fusion efficiency. These encompass block position, shape, sequence, and relative arrangements. Image features are segmented into semantic and spatial aspects, while text blocks are divided into positional and index attributes. Efficient multimodal fusion is achieved with multi-layer Transformer Encoders. Experimental results on the MultiDoc-InfoExtract Dataset demonstrate superior performance of this method in Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks, achieving F1, Precision, and Recall scores of 0.975, 0.975, and 0.975, respectively, in SER tasks, and 0.969, 0.953, and 0.986, respectively, in RE tasks, with a single image inference time of only 0.082s. Ablation studies validate the significance of the improved image feature extraction model, the inclusion of text block features, feature refinement strategies, and the Transformer Encoder architecture in enhancing the performance of the question-answering system. Additionally, comparative studies have shown that RSFE-TBT outperforms competing models regarding accuracy, speed, and size.

show abstract

Section: A Feature Alignmentmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

PiTeQA-Net: Picto-Textual Question Answering Network for Multimodal Document Image Analysis

Jiaming

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Multimodal feature fusion was first pioneered in the field of visual question answering. 18 Multimodal fusion refers to obtain information from text, image, voice, video, and other fields to realize information conversion and fusion, so as to enhance the data acceptance ability of the model. It is a typical interdisciplinary field.…”

Section: Multimodal Feature Fusionsmentioning

confidence: 99%

Multimodal based attention-pyramid for predicting pedestrian trajectory

et al. 2022

View full text Add to dashboard Cite

. The goal of pedestrian trajectory prediction is to predict the future trajectory according to the historical one of pedestrians. Multimodal information in the historical trajectory is conducive to perception and positioning, especially visual information and position coordinates. However, most of the current algorithms ignore the significance of multimodal information in the historical trajectory. We describe pedestrian trajectory prediction as a multimodal problem, in which historical trajectory is divided into an image and coordinate information. Specifically, we apply fully connected long short-term memory (FC-LSTM) and convolutional LSTM (ConvLSTM) to receive and process location coordinates and visual information respectively, and then fuse the information by a multimodal fusion module. Then, the attention pyramid social interaction module is built based on information fusion, to reason complex spatial and social relations between target and neighbors adaptively. The proposed approach is validated on different experimental verification tasks on which it can get better performance in terms of accuracy than other counterparts.

show abstract

IoT-based analysis of tennis player’s serving behavior using image processing

Huang

2023

Soft Comput

View full text Add to dashboard Cite

Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism

Cited by 4 publications

References 9 publications

PiTeQA-Net: Picto-Textual Question Answering Network for Multimodal Document Image Analysis

PiTeQA-Net: Picto-Textual Question Answering Network for Multimodal Document Image Analysis

Multimodal based attention-pyramid for predicting pedestrian trajectory

IoT-based analysis of tennis player’s serving behavior using image processing

Contact Info

Product

Resources

About