Rubèn Tito scite author profile

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23, 038 images annotated with 31, 791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios.The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding.A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

show abstract

Real-time Lexicon-free Scene Text Retrieval

Mafla

Tito

Dey

et al. 2021

Pattern Recognition

View full text Add to dashboard Cite

ICDAR 2021 Competition on Document Visual Question Answering

Tito¹,

Mathew

Jawahar

et al. 2021

View full text Add to dashboard Cite

Document Collection Visual Question Answering

Tito¹,

Karatzas²,

Valveny³

2021

View full text Add to dashboard Cite

Document Collection Visual Question Answering

Tito¹,

Karatzas²,

Valveny³

2021

Preprint

View full text Add to dashboard Cite

Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed over a whole collection of document images and the goal is not only to provide the answer to the given question, but also to retrieve the set of documents that contain the information needed to infer the answer. Along with the dataset we propose a new evaluation metric and baselines which provide further insights to the new dataset and task.

show abstract

Multimodal grid features and cell pointers for scene text visual question answering

Gómez

Biten

Tito

et al. 2021

Pattern Recognition Letters

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rubèn Tito

Scene Text Visual Question Answering

InfographicVQA

ICDAR 2019 Competition on Scene Text Visual Question Answering

Real-time Lexicon-free Scene Text Retrieval

ICDAR 2021 Competition on Document Visual Question Answering

Document Collection Visual Question Answering

Document Collection Visual Question Answering

Multimodal grid features and cell pointers for scene text visual question answering

Contact Info

Product

Resources

About