Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 2019
DOI: 10.24963/ijcai.2019/466
|View full text |Cite
|
Sign up to set email alerts
|

Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents

Abstract: Classifying heterogeneous visually rich documents is a challenging task. Difficulty of this task increases even more if the maximum allowed inference turnaround time is constrained by a threshold. The increased overhead in inference cost, compared to the limited gain in classification capabilities make current multi-scale approaches infeasible in such scenarios. There are two major contributions of this work. First, we propose a spatial pyramid model to extract highly discriminative multi-scale feature descrip… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 24 publications
(13 citation statements)
references
References 0 publications
0
12
0
Order By: Relevance
“…Moreover, with the same modal information, our LayoutLMv2 models also outperform existing multi-modal approaches PICK , TRIE and the previous top-1 method on the leaderboard, 2 confirming the effectiveness of our pre-training for text, layout, and visual information. The best performance on all the four datasets is achieved by (Das et al, 2018) 91.11% Ensemble (Das et al, 2018) 92.21% InceptionResNetV2 (Szegedy et al, 2017) 92.63% LadderNet (Sarkhel and Nandi, 2019) 92.77% Single model (Dauphinee et al, 2019) 93.03% Ensemble (Dauphinee et al, 2019) 93.07% tuned on the train set. By using all data (train + dev) as the fine-tuning dataset, the LayoutLMv2 LARGE single model outperforms the previous top-1 on the leaderboard which ensembles 30 models.…”
Section: Entity Extraction Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…Moreover, with the same modal information, our LayoutLMv2 models also outperform existing multi-modal approaches PICK , TRIE and the previous top-1 method on the leaderboard, 2 confirming the effectiveness of our pre-training for text, layout, and visual information. The best performance on all the four datasets is achieved by (Das et al, 2018) 91.11% Ensemble (Das et al, 2018) 92.21% InceptionResNetV2 (Szegedy et al, 2017) 92.63% LadderNet (Sarkhel and Nandi, 2019) 92.77% Single model (Dauphinee et al, 2019) 93.03% Ensemble (Dauphinee et al, 2019) 93.07% tuned on the train set. By using all data (train + dev) as the fine-tuning dataset, the LayoutLMv2 LARGE single model outperforms the previous top-1 on the leaderboard which ensembles 30 models.…”
Section: Entity Extraction Tasksmentioning
confidence: 99%
“…The recent progress of VrDU lies primarily in two directions. The first direction is usually built on the shallow fusion between textual and visual/layout/style information (Yang et al, 2017;Liu et al, 2019;Sarkhel and Nandi, 2019;Majumder et al, 2020;Wei et al, 2020;. These approaches leverage the pre-trained NLP and CV models individually and combine the information from multiple modalities for supervised learning.…”
Section: Introductionmentioning
confidence: 99%
“…A DCNN-based approach utilizing AlexNet, VGG16, GoogLeNet, and ResNet-50 was proposed in [46], where classification accuracy of 91.13% is recorded. In [47], a spatial pyramid model is proposed to extract high discriminant multi-scale features of document images by utilizing the inherited layouts of images. A deep multi-column CNN model is used to classify the images with an overall classification accuracy of 82.78%.…”
Section: Discussionmentioning
confidence: 99%
“…Document-type classification Classification of documents by type has frequently been treated as an image classification problem. Many works have used varying CNN architectures (Kang et al, 2014;Afzal et al, 2015;Harley et al, 2015;Afzal et al, 2017;Tensmeyer and Martinez, 2017;Das et al, 2018) or other vision-based techniques Sarkhel and Nandi, 2019).…”
Section: Related Workmentioning
confidence: 99%