OCR-VQA: Visual Question Answering by Reading Text in Images

Huang

Proceedings of the 28th International Conference on Computational Linguistics

2020

Image text carries essential information to understand the scene and perform reasoning. Textbased visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localizationaware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin. *

Section: Introductionmentioning

confidence: 84%

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Huang

Proceedings of the 28th International Conference on Computational Linguistics

2020

Proceedings of the 28th ACM International Conference on Multimedia

“…How to leverage information from text tokens, how to understand relationships between text tokens and visual objects or between different tokens, how to predict a text token with language models are still problems that need to be explored. [24] propose to extract text blocks before conducting optical character recognition. The block features are then combined with image features and question features to predict the final answer.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Wang

Tang

Luo

2020

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design-Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

Pattern Recognition and Artificial Intelligence

“…11%) [4] can be attributed to the straightforward architecture used for their authors, more analysis are required to determine the convenience of using this n-gram representation for the answer space. As this task is attracting attention, recent works present the task by introducing new databases, [10] introduces a new database, OCR-VQA-200K comprising images of bookcovers, [12] introduces a database containing images of business brands, movie posters and book covers.…”

Section: Related Workmentioning

confidence: 99%

An Extended Evaluation of the Impact of Different Modules in ST-VQA Systems

Beltrán

Coustaty

Journet

et al. 2020

Scene Text VQA has been recently proposed as a new challenging task in the context of multimodal content description. The aim is to teach traditional VQA models to read text contained in natural images by performing a semantic analysis between the visual content and the textual information contained in associated questions to give the correct answer. In this work, we present results obtained after evaluating the relevance of different modules in the proposed frameworks using several experimental setups and baselines, as well as to expose some of the main drawbacks and difficulties when facing this problem. We makes use of a strong VQA architecture and explore key model components such as suitable embeddings for each modality, relevance of the dimension of the answer space, calculation of scores and appropriate selection of the number of spaces in the copy module, and the gain in improvement when additional data is sent to the system. We make emphasis and present alternative solutions to the out-of-vocabulary (OOV) problem which is one of the critical points when solving this task. For the experimental phase, we make use of the TextVQA database, which is one of the main databases targeting this problem.