Local self-attention in transformer for visual question answering

Shen, Xiang; Han, Dezhi; Guo, Zihan; Chen, Chongqing; Hua, Jie; Luo, Gaofeng

doi:10.1007/s10489-022-04355-w

Cited by 26 publications

(16 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, an intrusion detection model (RESNETCCN) is proposed that fuses traffic detection requirements. In our future work, we will introduce more new ideas such as blockchain cryptography [8], [18], [9], [19], [16], alliance chain [36], [7], [20], visual Q&A [5], [28], transformer [21], panoramic image [17], reinforcement learning [3], internet of things [23], [24], shared data [6] in our model.We will continue to explore network intrusion detection methods in more areas such as unsupervised and semi-supervised [2] areas for network anomalous traffic data detection. In addition, we also try to introduce new evaluation metrics and establish systematic evaluation methods of intrusion detection.…”

Section: Discussionmentioning

confidence: 99%

RESNETCNN: An abnormal network traffic flows detection model

Han

Cui

et al. 2023

ComSIS

View full text Add to dashboard Cite

Intrusion detection is an important means to protect system security by detecting intrusions or intrusion attempts on the system through operational behaviors, security logs, and data audit. However, existing intrusion detection systems suffer from incomplete data feature extraction and low classification accuracy, which affects the intrusion detection effect. To this end, this paper proposes an intrusion detection model that fuses residual network(RESNET) and parallel cross-convolutional neural network, called RESNETCCN. RESNETCNN can efficiently learn various data stream features through the fusion of deep learning and convolutional neural network (CNN), which improves the detection accuracy of abnormal data streams in unbalanced data streams, moreover, the oversampling method into the data preprocessing, to extract multiple types of unbalanced data stream features at the same time, effectively solving the problems of incomplete data feature extraction and low classification accuracy of unbalanced data streams. Finally, three improved versions of RESNETCNN networks are designed to meet the requirements of different traffic data processing, and the highest detection accuracy reaches 99.98% on the CICIDS 2017 dataset and 99.90% on the ISCXIDS 2012 dataset.

show abstract

Section: Discussionmentioning

confidence: 99%

RESNETCNN: An abnormal network traffic flows detection model

Han

Cui

et al. 2023

ComSIS

View full text Add to dashboard Cite

show abstract

“…As shown in Table 4 , we compare the MAGM model with the current SOTA model, and the last row of Table 4 is the test result of the MAGM model proposed in this paper. The bilinear attention network BAN [ 20 ] considers the bilinear interaction between multimodal inputs to utilize the question feature and image feature information fully; BAN-Counter [ 20 ] combines BAN with Counter [ 20 ], which is a neural network component, can further improve the accuracy of the model on Number-type problems through the robust counting function; Bottom-up [ 59 ] and Bottom-up+MFH [ 25 ] combine regional visual features with question-guided visual attention; LSAT-R [ 28 ] model considers local self-attention, which can effectively avoid redundant information in global self-attention (“-R” indicates that the LSAT model is trained on the VQA2.0 dataset using the same region image features as the MAGM model and other SOTA models for comparison); Unified VLP [ 46 ] is a bidirectional and seq2seq-based unified visual-language pre-training model that can be fine-tuned for visual-language generation and understanding tasks. The pre-training models ViLBERT [ 43 ] and VisualBERT [ 47 ] use the BERT architecture, where VisualBERT is a single-stream model, and ViLBERT is a two-stream model.…”

Section: Methodsmentioning

confidence: 99%

Multi-modal adaptive gated mechanism for visual question answering

Zhang

Shen

2023

PLoS ONE

Self Cite

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and answer questions based on image content. For multimodal tasks, obtaining accurate modality feature information is crucial. The existing researches on the visual question answering model mainly start from the perspective of attention mechanism and multimodal fusion, which will tend to ignore the impact of modal interaction learning and the introduction of noise information in the process of modal fusion on the overall performance of the model. This paper proposes a novel and efficient multimodal adaptive gated mechanism model, MAGM. The model adds an adaptive gate mechanism to the intra- and inter-modality learning and the modal fusion process. This model can effectively filter irrelevant noise information, obtain fine-grained modal features, and improve the ability of the model to adaptively control the contribution of the two modal features to the predicted answer. In intra- and inter-modality learning modules, the self-attention gated and self-guided-attention gated units are designed to filter text and image features’ noise information effectively. In modal fusion module, the adaptive gated modal feature fusion structure is designed to obtain fine-grained modal features and improve the accuracy of the model in answering questions. Quantitative and qualitative experiments on the two VQA task benchmark datasets, VQA 2.0 and GQA, proved that the method in this paper is superior to the existing methods. The MAGM model has an overall accuracy of 71.30% on the VQA 2.0 dataset and an overall accuracy of 57.57% on the GQA dataset.

show abstract

“…The second type is content-based sparse attention [18,[41][42][43], which dynamically computes attention weights for input data, adaptively allocating attention to prioritize essential information while minimizing processing of irrelevant details, thereby demonstrating enhanced flexibility and adaptability. Furthermore, the local attention mechanism [16,44] is also regarded as a specialized form of sparse attention, predominantly utilizing a window mechanism to achieve localized and sparse focus within the data. In the VQA tasks, researchers often employ sparse attention mechanisms [29,45,46] to select crucial question or image features.…”

Section: Sparse Attentionmentioning

confidence: 99%

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

Shen,

Han,

Chang

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

A profound understanding and reasoning of the relationship between images and question are crucial in Visual Question Answering (VQA) tasks. However, traditional self-attention mechanisms exhibit limitations , primarily confined to spatial domain modeling of images, lacking 20 the capability to adequately model and analyze visual information at different scales in the frequency domain. Additionally, the traditional self-attention-based image feature modeling introduces noise when capturing long-distance dependencies, causing the model to overly focus on irrelevant details, thereby reducing robustness. To address these issues, 25 this paper proposes a novel Gaussian Fourier with Sparse Attention Network (GFSNet). GFSNet utilizes Fourier transform techniques to represent image attention weights obtained through self-attention in the frequency domain, facilitating the effective modeling of different scale information by analyzing attention weights in the frequency domain. 30 Recognizing that different scale information in images often manifests as distinct frequency components, the model can better capture and 1 Springer Nature 2021 L A T E X template Gaussian Fourier with Sparse Attention Network for VQA adapt to the complex structures and correlations of these various scale details. To mitigate high-frequency noise in the frequency domain, we design an adaptive Gaussian filter to effectively suppress or filter noise in 35 the images. Finally, a novel sparse attention mechanism is introduced to select optimized key frequency domain features. This enables the model to more effectively focus on critical image regions, reducing the processing of irrelevant or redundant information, while enhancing interpretability and robustness. The proposed GFSNet model aims to achieve effective 40 modeling of visual information at different scales without increasing model parameters or altering computational complexity. Extensive experiments on the VQAv2 and GQA benchmark datasets unequivocally demonstrate the superiority and effectiveness of the GFSNet approach. Source code is available at https://github.com/shenxiang-vqa/GFSNet.

show abstract

Local self-attention in transformer for visual question answering

Cited by 26 publications

References 45 publications

RESNETCNN: An abnormal network traffic flows detection model

RESNETCNN: An abnormal network traffic flows detection model

Multi-modal adaptive gated mechanism for visual question answering

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

Contact Info

Product

Resources

About