2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00439
|View full text |Cite
|
Sign up to set email alerts
|

Scene Text Visual Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
81
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 173 publications
(81 citation statements)
references
References 33 publications
0
81
0
Order By: Relevance
“…This line of research has been pursued by several studies, particularly thanks to the introduction of the Fact‐based VQA dataset (Wang et al., 2017b). The VQA task is now taking new directions, such as embodied approaches where an agent has to navigate an environment and answer questions about it (H. Chen et al., 2019; Das et al., 2018); video VQA, where the answer has to be found in videos rather than in static images (Lei et al., 2018, 2020); answering questions about diagrams and charts (Ebrahimi Kahou et al., 2017; Kafle et al., 2018); text VQA, which involves recognizing and interpreting textual content in images (Biten et al., 2019; Han et al., 2020); answering questions about medical images (see, Abacha et al., 2020); and many others.…”
Section: The Recent Revival Of Vqamentioning
confidence: 99%
“…This line of research has been pursued by several studies, particularly thanks to the introduction of the Fact‐based VQA dataset (Wang et al., 2017b). The VQA task is now taking new directions, such as embodied approaches where an agent has to navigate an environment and answer questions about it (H. Chen et al., 2019; Das et al., 2018); video VQA, where the answer has to be found in videos rather than in static images (Lei et al., 2018, 2020); answering questions about diagrams and charts (Ebrahimi Kahou et al., 2017; Kafle et al., 2018); text VQA, which involves recognizing and interpreting textual content in images (Biten et al., 2019; Han et al., 2020); answering questions about medical images (see, Abacha et al., 2020); and many others.…”
Section: The Recent Revival Of Vqamentioning
confidence: 99%
“…Gurari et al [36] showed that in a goal oriented VQA setting where visually impaired individuals ask questions on images they take, answering a good number of questions requires the ability to read and interpret text on the images. This inspired introduction of two datasets-Scene Text VQA [13] and TextVQA [12], where reading text on the images is pivotal to answering the questions asked on the images. Our work is different from these tasks on two accounts: (i) these datasets contain images "in the wild" which are drawn from popular scene text datasets or datasets like OpenImages [37], which predominantly have scattered text tokens compared to the handwritten document images we consider, and (ii) almost all VQA problems including the ones involving text on the images are formulated as QA on a single image, while the proposed QA task is for a collection of document images.…”
Section: Related Workmentioning
confidence: 99%
“…There are two parallel streams of work in Computer Vision (CV) and Natural Language Processing (NLP), toward measuring how well machines understand visual and textual data, respectively. The computer vision community has recently defined tasks like image captioning [9] and Visual Question Answering (VQA) [5,[10][11][12][13]. In VQA, objective performance is measured by looking at how accurately a model can answer a set of questions asked on images that humans can answer comfortably.…”
Section: Introductionmentioning
confidence: 99%
“…We appreciate the difficulty of the task, that requires not only to read the text correctly but also to understand the visual context in order to correctly answer the question. More details about the dataset can be found in [5].…”
Section: Competition Protocolmentioning
confidence: 99%
“…Leveraging scene text information in a VQA scenario implies a shift from existing models that cast VQA as a classification problem, to generative approaches that are † Equal contribution 1 http://vizwiz.org/data/ able generate novel answers (in this by recognizing and integrating scene text as necessary in the answer). For the proposed "Scene Text Visual Question Answering" (ST-VQA) challenge, we employ a new dataset, introduced by organizers of the challenge [5]. The questions and answers in this dataset are defined in such a way that no question can be answered without reading/understanding scene text present in the given image.…”
Section: Introductionmentioning
confidence: 99%