Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents

Sarkhel, Ritesh; Nandi, Arnab

doi:10.24963/ijcai.2019/466

Cited by 24 publications

(13 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, with the same modal information, our LayoutLMv2 models also outperform existing multi-modal approaches PICK , TRIE and the previous top-1 method on the leaderboard, 2 confirming the effectiveness of our pre-training for text, layout, and visual information. The best performance on all the four datasets is achieved by (Das et al, 2018) 91.11% Ensemble (Das et al, 2018) 92.21% InceptionResNetV2 (Szegedy et al, 2017) 92.63% LadderNet (Sarkhel and Nandi, 2019) 92.77% Single model (Dauphinee et al, 2019) 93.03% Ensemble (Dauphinee et al, 2019) 93.07% tuned on the train set. By using all data (train + dev) as the fine-tuning dataset, the LayoutLMv2 LARGE single model outperforms the previous top-1 on the leaderboard which ensembles 30 models.…”

Section: Entity Extraction Tasksmentioning

confidence: 99%

“…The recent progress of VrDU lies primarily in two directions. The first direction is usually built on the shallow fusion between textual and visual/layout/style information (Yang et al, 2017;Liu et al, 2019;Sarkhel and Nandi, 2019;Majumder et al, 2020;Wei et al, 2020;. These approaches leverage the pre-trained NLP and CV models individually and combine the information from multiple modalities for supervised learning.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Xu¹,

Xu²,

Lv³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

220

192

View full text Add to dashboard Cite

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-ofthe-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.8340 → 0.8520), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672). We made our model and code publicly available at https://aka.ms /layoutlmv2.

show abstract

Section: Entity Extraction Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Xu¹,

Xu²,

Lv³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

220

192

View full text Add to dashboard Cite

show abstract

“…A DCNN-based approach utilizing AlexNet, VGG16, GoogLeNet, and ResNet-50 was proposed in [46], where classification accuracy of 91.13% is recorded. In [47], a spatial pyramid model is proposed to extract high discriminant multi-scale features of document images by utilizing the inherited layouts of images. A deep multi-column CNN model is used to classify the images with an overall classification accuracy of 82.78%.…”

Section: Discussionmentioning

confidence: 99%

Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

Nasir

Khan

Yasmin

et al. 2020

Sensors

View full text Add to dashboard Cite

Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.

show abstract

“…Document-type classification Classification of documents by type has frequently been treated as an image classification problem. Many works have used varying CNN architectures (Kang et al, 2014;Afzal et al, 2015;Harley et al, 2015;Afzal et al, 2017;Tensmeyer and Martinez, 2017;Das et al, 2018) or other vision-based techniques Sarkhel and Nandi, 2019).…”

Section: Related Workmentioning

confidence: 99%

Layout-Aware Text Representations Harm Clustering Documents by Type

Finegan-Dollak¹,

Verma²

2020

Proceedings of the First Workshop on Insights From Negative Results in NLP

View full text Add to dashboard Cite

Clustering documents by type-grouping invoices with invoices and articles with articles-is a desirable first step for organizing large collections of document scans. Humans approaching this task use both the semantics of the text and the document layout to assist in grouping like documents. Lay-outLM (Xu et al., 2019), a layout-aware transformer built on top of BERT with state-of-theart performance on document-type classification, could reasonably be expected to outperform regular BERT (Devlin et al., 2018) for document-type clustering. However, we find experimentally that BERT significantly outperforms LayoutLM on this task (p < 0.001). We analyze clusters to show where layout awareness is an asset and where it is a liability.

show abstract

Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents

Cited by 24 publications

References 0 publications

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

Layout-Aware Text Representations Harm Clustering Documents by Type

Contact Info

Product

Resources

About