SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images

Tanaka, Ryota; Nishida, Keiya; Nishida, Kosuke; Hasegawa, Taku; Saito, Itsumi; Saito, Kuniko

doi:10.1609/aaai.v37i11.26598

Cited by 5 publications

(1 citation statement)

References 32 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Multi-page QA w/ Multi-hop & Discrete & Visual Reasoning requires understanding the content relationship via multi-hop reasoning as well as discrete/visual reasoning on multi-page documents (Tanaka et al 2023;Landeghem et al 2023).…”

Section: Dataset Collectionmentioning

confidence: 99%

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Tanaka,

Iki,

Nishida

et al. 2024

AAAI

View full text Add to dashboard Cite

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

show abstract

Section: Dataset Collectionmentioning

confidence: 99%

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Tanaka,

Iki,

Nishida

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

Document Understanding Dataset and Evaluation (DUDE)

Landeghem,

Powalski,

Tito

et al. 2023

2023 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset 1 with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins, and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI.1 huggingface.co/datasets/jordyvl/DUDE_loader

show abstract

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Jin,

Zhang,

et al. 2024

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images

Cited by 5 publications

References 32 publications

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Document Understanding Dataset and Evaluation (DUDE)

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Contact Info

Product

Resources

About