2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01605
|View full text |Cite
|
Sign up to set email alerts
|

LaTr: Layout-Aware Transformer for Scene-Text VQA

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 43 publications
(14 citation statements)
references
References 47 publications
0
13
0
Order By: Relevance
“…In Table 6, our method is compared with existing methods on the ST‐VQA dataset. Our proposed method surpasses Two‐way Co‐att (Sharma et al 2022) and LaTr 37 by 3.02% and 4.57% accuracy on the validation set. In terms of the ANLS$$ \mathrm{ANLS} $$ score, our method surpasses Two‐way Co‐att (Sharma et al 2022), LaTr 37 and LOGOS 33 by 0.001$$ 0.001 $$, 0.024$$ 0.024 $$ and 0.118$$ 0.118 $$ on the validation set.…”
Section: Performance Analysismentioning
confidence: 76%
See 1 more Smart Citation
“…In Table 6, our method is compared with existing methods on the ST‐VQA dataset. Our proposed method surpasses Two‐way Co‐att (Sharma et al 2022) and LaTr 37 by 3.02% and 4.57% accuracy on the validation set. In terms of the ANLS$$ \mathrm{ANLS} $$ score, our method surpasses Two‐way Co‐att (Sharma et al 2022), LaTr 37 and LOGOS 33 by 0.001$$ 0.001 $$, 0.024$$ 0.024 $$ and 0.118$$ 0.118 $$ on the validation set.…”
Section: Performance Analysismentioning
confidence: 76%
“…The model further incorporates a transformer model and dynamic pointer networks for answer decoding. LaTr 37 proposes a layout‐aware multimodal pre‐training task based on T5 with an extensive Industrial Document Library. Sharma and Srivastava 38 introduced enhancements to textual representations by incorporating FastText embeddings, size and color features, location information, and character‐level information.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, to deal with images with different resolutions, Raisi et al [26] develop a transformer-based architecture for recognizing texts in images by using a 2D positional encoder so that the spatial information the features can be preserved. Biten et al [27] propose a layer-aware transformer with a pre-training scheme on the basis of text and spatial cues only and show that it works well on scanned documents to handle multimodality in scene text visual question answering. Based on ViT [4], Tan et al [28] propose a mixture experts of pure transformers for processing different resolutions for scene text recognition.…”
Section: B Vision Transformers For Strmentioning
confidence: 99%
“…More Applications. Besides VLP for standard VL tasks, VLP has also been applied to tackle (i) TextVQA (Singh et al, 2019) and TextCaps (Sidorov et al, 2020) tasks that require an AI system to comprehend scene text in order to perform VQA and captioning, such as TAP (Yang et al, 2021d) and LaTr (Biten et al, 2022); (ii) visual dialog (Das et al, 2017) that requires an AI system to chat about an input image, such as VisDial-BERT (Murahari et al, 2020) and VD-BERT (Wang et al, 2020b); (iii) fashion-domain tasks, such as Kaleido-BERT (Zhuge et al, 2021) and Fashion-VLP (Goenka et al, 2022); and (iv) vision-language navigation (VLN), such as PREVALENT (Hao et al, 2020) and VLN-BERT (Hong et al, 2021), to name a few. A detailed literature review on VLN can be found in Gu et al (2022b).…”
Section: Vlp For L Big Modelsmentioning
confidence: 99%