2021
DOI: 10.1016/j.patrec.2021.06.026
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal grid features and cell pointers for scene text visual question answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(4 citation statements)
references
References 3 publications
0
4
0
Order By: Relevance
“…Han et al 26 presents the (LaAP‐Net), a model which not only guesses answers to questions together with generation of the bounding boxes as proof to support the predicted answer. An innovative STVQA model that uses an attention method to mutually reason about textual and visual modalities was developed by Gomez et al 27 By attending to multi‐modal grid features, their model achieves a comprehensive understanding of the scene, enhancing performance in the task. The Cascade Reasoning Network (CRN), 28 which combines a progressive attention module (PAM) with a multimodal reasoning graph (MRG) module, was suggested by Liu et al 28 The PAM performs stepwise encoding to fuse multimodal information, leveraging previous attention results to guide subsequent fusion steps.…”
Section: Related Workmentioning
confidence: 99%
“…Han et al 26 presents the (LaAP‐Net), a model which not only guesses answers to questions together with generation of the bounding boxes as proof to support the predicted answer. An innovative STVQA model that uses an attention method to mutually reason about textual and visual modalities was developed by Gomez et al 27 By attending to multi‐modal grid features, their model achieves a comprehensive understanding of the scene, enhancing performance in the task. The Cascade Reasoning Network (CRN), 28 which combines a progressive attention module (PAM) with a multimodal reasoning graph (MRG) module, was suggested by Liu et al 28 The PAM performs stepwise encoding to fuse multimodal information, leveraging previous attention results to guide subsequent fusion steps.…”
Section: Related Workmentioning
confidence: 99%
“…For the M+N targets, the above method is used to obtain the corresponding locational location manifestation X loc 2R (M+N)×(M+N)×4 . Furthermore, X loc is passed through a two-layer fully connected network to obtain the final locational relation knowledge manifestation R loc 2R (M +N)×(M+N) , as shown in (3).…”
Section: Locational Collaboration Knowledge Manifestation With Inter-...mentioning
confidence: 99%
“…Literature [2] and Literature [3] proposed integrating textual content into VQA, forming the task of TextVQA (Text Visual Question Answering) and ST-VQA (Scenario Text Visual Question Answering) respectively, along with the construction of benchmark datasets. Fig 2 illustrates an example of the ST-VQA task, where the questions are related to the scenario text in the image, requiring the model to establish unified collaborations between the question, visual targets, and scenario text to generate correct answers.…”
Section: Introductionmentioning
confidence: 99%
“…Secondly, we decide to use a commercial OCR engine, specifically Amazon Textract 3 , over Tesseract. It is because the performance of the OCR engines can significantly affect the model's performance which can be seen in fields that use OCR annotations, such as in fine-grained classification [29,30,31], in scenetext visual question answering [9,44,8,13], in document visual question answering (DocVQA) [50,33]. Apart from improving the annotation quality significantly, we want to level the differences between research groups and companies.…”
Section: Introductionmentioning
confidence: 99%