2021
DOI: 10.1007/978-3-030-86337-1_42
|View full text |Cite
|
Sign up to set email alerts
|

ICDAR 2021 Competition on Document Visual Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(13 citation statements)
references
References 23 publications
0
11
0
Order By: Relevance
“…BROS was proposed with an effective pretraining method (i.e., area masking) and a relative positional encoding trick. To validate the effectiveness of Webvicob-generated data, we pretrain BROS and measure performance on DocVQA Task 1 (Tito et al, 2021) and Task 3 (Mathew et al, 2022).…”
Section: Comparison Methodsmentioning
confidence: 99%
“…BROS was proposed with an effective pretraining method (i.e., area masking) and a relative positional encoding trick. To validate the effectiveness of Webvicob-generated data, we pretrain BROS and measure performance on DocVQA Task 1 (Tito et al, 2021) and Task 3 (Mathew et al, 2022).…”
Section: Comparison Methodsmentioning
confidence: 99%
“…However, these approaches are unsuitable for real-world applications due to their high computational cost. For example, answering questions in DocVQA (Tito et al, 2021) requires an average of nearly 400 and a maximum of 4,000 OCR tokens (See Section 4.3 and Figure 7). By incorporating the learned queries mechanism and considering the application context, the integration of Cream and LLMs becomes more flexible, enabling the LLM to focus on specific aspects of visual input while generating accurate and contextually appropriate responses.…”
Section: Integration Of Cream and Llmsmentioning
confidence: 99%
“…Our models are evaluated using the test sets of ChartQA (Masry et al, 2022), Infograph-icVQA (Mathew et al, 2022), and DocVQA (Tito et al, 2021), in order to gauge their effectiveness in accurately answering natural language queries reliant on a profound understanding and recognition of various image elements, such as text, objects, and relationships. A sample of the test datasets used is depicted in Figure 6.…”
Section: Test Datasetsmentioning
confidence: 99%
“…Document Intelligence can be considered as an umbrella term covering problems of Key Information Extraction [10,54], Table Detection [41,38] and Structure Recognition [39,55], Document Layout Segmentation [5,4] Document Layout Generation [6,36,3,48], Document Visual Question Answering [51,50,32], Document Image Enhancement [49,22,47] which involves the understanding of visually rich semantic information and structure of different layout entities of a whole page.…”
Section: Related Workmentioning
confidence: 99%
“…The Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) [18] dataset used the IIT-CDIP metadata to create a new dataset for document classification. PublayNet [56] and DocBank [26] are datasets designed for layout analysis tasks and DocVQA [33,51] instead, is designed for Visual Question Answering task over document images.…”
Section: Comparison To Existing Datasetsmentioning
confidence: 99%