SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Xiong, Peixi; You, Quanzeng; Yu, Pei; Liu, Zicheng; Wu, Ying

doi:10.48550/arxiv.2201.10654

Cited by 2 publications

(3 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the value exceeds t i , we record the position (i, j) in the position matrix P . The entire process is mathematically described in equations ( 11) and (12). By adopting this approach, we can effectively concentrate on critical information while also considering the positional context of the selected elements, leading to an improved feature representation and a deeper understanding of the underlying patterns in the image.…”

Section: Sparse Attentionmentioning

confidence: 99%

“…Other research endeavors concentrate on modeling and understanding image information to achieve a profound comprehension of image content and address intricate reasoning questions related to the queries. In the realm of semantic relationship modeling in images, researchers primarily employ graph neural network methods [10][11][12][13] for explicit and implicit reasoning. The objective of these methods is to capture semantic relationships within images, thereby enabling the model to better understand the connection between image content and questions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

Shen,

Han,

Chang

et al. 2024

Preprint

View full text Add to dashboard Cite

A profound understanding and reasoning of the relationship between images and question are crucial in Visual Question Answering (VQA) tasks. However, traditional self-attention mechanisms exhibit limitations , primarily confined to spatial domain modeling of images, lacking 20 the capability to adequately model and analyze visual information at different scales in the frequency domain. Additionally, the traditional self-attention-based image feature modeling introduces noise when capturing long-distance dependencies, causing the model to overly focus on irrelevant details, thereby reducing robustness. To address these issues, 25 this paper proposes a novel Gaussian Fourier with Sparse Attention Network (GFSNet). GFSNet utilizes Fourier transform techniques to represent image attention weights obtained through self-attention in the frequency domain, facilitating the effective modeling of different scale information by analyzing attention weights in the frequency domain. 30 Recognizing that different scale information in images often manifests as distinct frequency components, the model can better capture and 1 Springer Nature 2021 L A T E X template Gaussian Fourier with Sparse Attention Network for VQA adapt to the complex structures and correlations of these various scale details. To mitigate high-frequency noise in the frequency domain, we design an adaptive Gaussian filter to effectively suppress or filter noise in 35 the images. Finally, a novel sparse attention mechanism is introduced to select optimized key frequency domain features. This enables the model to more effectively focus on critical image regions, reducing the processing of irrelevant or redundant information, while enhancing interpretability and robustness. The proposed GFSNet model aims to achieve effective 40 modeling of visual information at different scales without increasing model parameters or altering computational complexity. Extensive experiments on the VQAv2 and GQA benchmark datasets unequivocally demonstrate the superiority and effectiveness of the GFSNet approach. Source code is available at https://github.com/shenxiang-vqa/GFSNet.

show abstract

Section: Sparse Attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

Shen,

Han,

Chang

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…To introduce more explicit and explainable information to represent the complex relationships, the visual scene graph [10] that contains more edge semantics than the classic Laplacian graph is introduced into the VQA task. Visual scene graph used in VQA methods is mainly constructed according to three types of explicable structural information: relative spatial information [11,28,29], visual relationship information [30][31][32] and KG information [18,33,34].…”

Section: Attention Based Vqa Workmentioning

confidence: 99%

DSGEM: Dual scene graph enhancement module‐based visual question answering

Wang

et al. 2023

IET Computer Vision

View full text Add to dashboard Cite

Visual Question Answering (VQA) aims to appropriately answer a text question by understanding the image content. Attention‐based VQA models mine the implicit relationships between objects according to the feature similarity, which neglects the explicit relationships between objects, for example, the relative position. Most Visual Scene Graph‐based VQA models exploit the relative positions or visual relationships between objects to construct the visual scene graph, while they suffer from the semantic insufficiency of visual edge relations. Besides, the scene graph of text modality is often ignored in these works. In this article, a novel Dual Scene Graph Enhancement Module (DSGEM) is proposed that exploits the relevant external knowledge to simultaneously construct two interpretable scene graph structures of image and text modalities, which makes the reasoning process more logical and precise. Specifically, the authors respectively build the visual and textual scene graphs with the help of commonsense knowledge and syntactic structure, which explicitly endows the specific semantics to each edge relation. Then, two scene graph enhancement modules are proposed to propagate the involved external and structural knowledge to explicitly guide the feature interaction between objects (nodes). Finally, the authors embed such two scene graph enhancement modules to existing VQA models to introduce the explicit relation reasoning ability. Experimental results on both VQA V2 and OK‐VQA datasets show that the proposed DSGEM is effective and compatible to various VQA architectures.

show abstract

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Cited by 2 publications

References 30 publications

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

GFSNet: Gaussian Fourier with Sparse Attention Network for Visual Question Answering

DSGEM: Dual scene graph enhancement module‐based visual question answering

Contact Info

Product

Resources

About