Path-Wise Attention Memory Network for Visual Question Answering

Xiang, Yingxin; Zhang, Chengyuan; Han, Zhichao; Yu, Hao; Li, Jiaye; Zhu, Lei

doi:10.3390/math10183244

Cited by 1 publication

(1 citation statement)

References 63 publications

(110 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is, the observation can be adjusted to the more informative features according to their relative importance, focusing the algorithm on the most relevant parts of the input, moving from focusing on global features to the focused features, thus saving resources and getting the most effective information quickly. The attention mechanism has arguably become one of the most important concepts in the field of deep learning, since Bahdanau, Cho & Bengio (2015) used attention mechanism for the machine interpretation tasks, various variants of attention mechanism have emerged, such as Co-Attention networks ( Yang et al, 2019a ; Han et al, 2021 ; Yu et al, 2019 ; Liu et al, 2021b ; Lu et al, 2016 ; Sharma & Srivastava, 2022 ), Recurrent Attention networks ( Osman & Samek, 2019 ; Ren & Zemel, 2017 ; Gan et al, 2019 ), Self-Attention networks ( Li et al, 2019 ; Fan et al, 2019 ; Ramachandran et al, 2019 ; Xia et al, 2022 ; Xiang et al, 2022 ; Yan, Silamu & Li, 2022 ), etc. The effectiveness of visual information processing is considerably enhanced by all of these attention mechanisms, which also optimize VQA performance.…”

Section: Introductionmentioning

confidence: 99%

The multi-modal fusion in visual question answering: a review of attention mechanisms

Liu

Yin

et al. 2023

PeerJ Computer Science

169

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a significant cross-disciplinary issue in the fields of computer vision and natural language processing that requires a computer to output a natural language answer based on pictures and questions posed based on the pictures. This requires simultaneous processing of multimodal fusion of text features and visual features, and the key task that can ensure its success is the attention mechanism. Bringing in attention mechanisms makes it better to integrate text features and image features into a compact multi-modal representation. Therefore, it is necessary to clarify the development status of attention mechanism, understand the most advanced attention mechanism methods, and look forward to its future development direction. In this article, we first conduct a bibliometric analysis of the correlation through CiteSpace, then we find and reasonably speculate that the attention mechanism has great development potential in cross-modal retrieval. Secondly, we discuss the classification and application of existing attention mechanisms in VQA tasks, analysis their shortcomings, and summarize current improvement methods. Finally, through the continuous exploration of attention mechanisms, we believe that VQA will evolve in a smarter and more human direction.

show abstract