Scene Text Visual Question Answering

Biten, Ali Furkan; Tito, Rubèn; Mafla, Andrés; Gómez, Lluís; Rusiñol, Marçal; Jawahar, C. V.; Valveny, Ernest; Karatzas, Dìmosthenis

doi:10.1109/iccv.2019.00439

Cited by 173 publications

(81 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This line of research has been pursued by several studies, particularly thanks to the introduction of the Fact‐based VQA dataset (Wang et al., 2017b). The VQA task is now taking new directions, such as embodied approaches where an agent has to navigate an environment and answer questions about it (H. Chen et al., 2019; Das et al., 2018); video VQA, where the answer has to be found in videos rather than in static images (Lei et al., 2018, 2020); answering questions about diagrams and charts (Ebrahimi Kahou et al., 2017; Kafle et al., 2018); text VQA, which involves recognizing and interpreting textual content in images (Biten et al., 2019; Han et al., 2020); answering questions about medical images (see, Abacha et al., 2020); and many others.…”

Section: The Recent Revival Of Vqamentioning

confidence: 99%

Linguistic issues behind visual question answering

Bernardi

Pezzelle

2021

Language and Linguist. Compass

View full text Add to dashboard Cite

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually‐grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually‐grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub‐field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

show abstract

Section: The Recent Revival Of Vqamentioning

confidence: 99%

Linguistic issues behind visual question answering

Bernardi

Pezzelle

2021

Language and Linguist. Compass

View full text Add to dashboard Cite

show abstract

“…Gurari et al [36] showed that in a goal oriented VQA setting where visually impaired individuals ask questions on images they take, answering a good number of questions requires the ability to read and interpret text on the images. This inspired introduction of two datasets-Scene Text VQA [13] and TextVQA [12], where reading text on the images is pivotal to answering the questions asked on the images. Our work is different from these tasks on two accounts: (i) these datasets contain images "in the wild" which are drawn from popular scene text datasets or datasets like OpenImages [37], which predominantly have scattered text tokens compared to the handwritten document images we consider, and (ii) almost all VQA problems including the ones involving text on the images are formulated as QA on a single image, while the proposed QA task is for a collection of document images.…”

Section: Related Workmentioning

confidence: 99%

“…There are two parallel streams of work in Computer Vision (CV) and Natural Language Processing (NLP), toward measuring how well machines understand visual and textual data, respectively. The computer vision community has recently defined tasks like image captioning [9] and Visual Question Answering (VQA) [5,[10][11][12][13]. In VQA, objective performance is measured by looking at how accurately a model can answer a set of questions asked on images that humans can answer comfortably.…”

Section: Introductionmentioning

confidence: 99%

Asking questions on handwritten document collections

Mathew

Gómez

Karatzas

et al. 2021

IJDAR

Self Cite

View full text Add to dashboard Cite

This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognitionfree approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.

show abstract

“…We appreciate the difficulty of the task, that requires not only to read the text correctly but also to understand the visual context in order to correctly answer the question. More details about the dataset can be found in [5].…”

Section: Competition Protocolmentioning

confidence: 99%

“…Leveraging scene text information in a VQA scenario implies a shift from existing models that cast VQA as a classification problem, to generative approaches that are † Equal contribution 1 http://vizwiz.org/data/ able generate novel answers (in this by recognizing and integrating scene text as necessary in the answer). For the proposed "Scene Text Visual Question Answering" (ST-VQA) challenge, we employ a new dataset, introduced by organizers of the challenge [5]. The questions and answers in this dataset are defined in such a way that no question can be answered without reading/understanding scene text present in the given image.…”

Section: Introductionmentioning

confidence: 99%

ICDAR 2019 Competition on Scene Text Visual Question Answering

Biten

Tito

Mafla

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

Self Cite

View full text Add to dashboard Cite

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23, 038 images annotated with 31, 791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios.The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding.A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

show abstract

Scene Text Visual Question Answering

Cited by 173 publications

References 33 publications

Linguistic issues behind visual question answering

Linguistic issues behind visual question answering

Asking questions on handwritten document collections

ICDAR 2019 Competition on Scene Text Visual Question Answering

Contact Info

Product

Resources

About