LaTr: Layout-Aware Transformer for Scene-Text VQA

Biten, Ali Furkan; Litman, Ron; Xie, Yuanyuan; Appalaraju, Srikar; Manmatha, R.

doi:10.1109/cvpr52688.2022.01605

Cited by 43 publications

(14 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Table 6, our method is compared with existing methods on the ST‐VQA dataset. Our proposed method surpasses Two‐way Co‐att (Sharma et al 2022) and LaTr 37 by 3.02% and 4.57% accuracy on the validation set. In terms of the

\mathrm{ANLS}

score, our method surpasses Two‐way Co‐att (Sharma et al 2022), LaTr 37 and LOGOS 33 by

0.001

0.024

and

0.118

on the validation set.…”

Section: Performance Analysismentioning

confidence: 76%

“…The model further incorporates a transformer model and dynamic pointer networks for answer decoding. LaTr 37 proposes a layout‐aware multimodal pre‐training task based on T5 with an extensive Industrial Document Library. Sharma and Srivastava 38 introduced enhancements to textual representations by incorporating FastText embeddings, size and color features, location information, and character‐level information.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Agrawal,

Jalal,

Sharma

2024

Computational Intelligence

View full text Add to dashboard Cite

Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text‐based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene‐text tokens and image objects. They simply incorporate the scene‐text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax‐based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST‐VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.

show abstract

\mathrm{ANLS}

score, our method surpasses Two‐way Co‐att (Sharma et al 2022), LaTr 37 and LOGOS 33 by

0.001

0.024

and

0.118

on the validation set.…”

Section: Performance Analysismentioning

confidence: 76%

Section: Related Workmentioning

confidence: 99%

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Agrawal,

Jalal,

Sharma

2024

Computational Intelligence

View full text Add to dashboard Cite

show abstract

“…For instance, to deal with images with different resolutions, Raisi et al [26] develop a transformer-based architecture for recognizing texts in images by using a 2D positional encoder so that the spatial information the features can be preserved. Biten et al [27] propose a layer-aware transformer with a pre-training scheme on the basis of text and spatial cues only and show that it works well on scanned documents to handle multimodality in scene text visual question answering. Based on ViT [4], Tan et al [28] propose a mixture experts of pure transformers for processing different resolutions for scene text recognition.…”

Section: B Vision Transformers For Strmentioning

confidence: 99%

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

Yan¹,

Fang²,

Jin³

2023

Preprint

View full text Add to dashboard Cite

While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with ngrams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.

show abstract

“…More Applications. Besides VLP for standard VL tasks, VLP has also been applied to tackle (i) TextVQA (Singh et al, 2019) and TextCaps (Sidorov et al, 2020) tasks that require an AI system to comprehend scene text in order to perform VQA and captioning, such as TAP (Yang et al, 2021d) and LaTr (Biten et al, 2022); (ii) visual dialog (Das et al, 2017) that requires an AI system to chat about an input image, such as VisDial-BERT (Murahari et al, 2020) and VD-BERT (Wang et al, 2020b); (iii) fashion-domain tasks, such as Kaleido-BERT (Zhuge et al, 2021) and Fashion-VLP (Goenka et al, 2022); and (iv) vision-language navigation (VLN), such as PREVALENT (Hao et al, 2020) and VLN-BERT (Hong et al, 2021), to name a few. A detailed literature review on VLN can be found in Gu et al (2022b).…”

Section: Vlp For L Big Modelsmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

LaTr: Layout-Aware Transformer for Scene-Text VQA

Cited by 43 publications

References 47 publications

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Contact Info

Product

Resources

About