2022
DOI: 10.48550/arxiv.2201.10654
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Abstract: Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align crossmodality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality repre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 30 publications
0
3
0
Order By: Relevance
“…When the value exceeds t i , we record the position (i, j) in the position matrix P . The entire process is mathematically described in equations ( 11) and (12). By adopting this approach, we can effectively concentrate on critical information while also considering the positional context of the selected elements, leading to an improved feature representation and a deeper understanding of the underlying patterns in the image.…”
Section: Sparse Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…When the value exceeds t i , we record the position (i, j) in the position matrix P . The entire process is mathematically described in equations ( 11) and (12). By adopting this approach, we can effectively concentrate on critical information while also considering the positional context of the selected elements, leading to an improved feature representation and a deeper understanding of the underlying patterns in the image.…”
Section: Sparse Attentionmentioning
confidence: 99%
“…Other research endeavors concentrate on modeling and understanding image information to achieve a profound comprehension of image content and address intricate reasoning questions related to the queries. In the realm of semantic relationship modeling in images, researchers primarily employ graph neural network methods [10][11][12][13] for explicit and implicit reasoning. The objective of these methods is to capture semantic relationships within images, thereby enabling the model to better understand the connection between image content and questions.…”
Section: Introductionmentioning
confidence: 99%
“…To introduce more explicit and explainable information to represent the complex relationships, the visual scene graph [10] that contains more edge semantics than the classic Laplacian graph is introduced into the VQA task. Visual scene graph used in VQA methods is mainly constructed according to three types of explicable structural information: relative spatial information [11,28,29], visual relationship information [30][31][32] and KG information [18,33,34].…”
Section: Attention Based Vqa Workmentioning
confidence: 99%