2020
DOI: 10.48550/arxiv.2010.02358
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

Abstract: We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3D matrix used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid [1] models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 11 publications
0
5
0
Order By: Relevance
“…Chargrid (Katti et al, 2018) uses a convolution-based encoder-decoder network to fuse text information into images by performing one-hot encoding on characters. VisualWord-Grid (Kerroumi et al, 2020) implements Wordgrid (Katti et al, 2018) by replacing character-level text information with word-level word2vec features, and fusing visual information to improve the extraction performance. BERTgrid (Denk & Reisswig, 2019) uses BERT to obtain contextual text representation, which further improves the end-to-end accuracy.…”
Section: Visual Information Extractionmentioning
confidence: 99%
“…Chargrid (Katti et al, 2018) uses a convolution-based encoder-decoder network to fuse text information into images by performing one-hot encoding on characters. VisualWord-Grid (Kerroumi et al, 2020) implements Wordgrid (Katti et al, 2018) by replacing character-level text information with word-level word2vec features, and fusing visual information to improve the extraction performance. BERTgrid (Denk & Reisswig, 2019) uses BERT to obtain contextual text representation, which further improves the end-to-end accuracy.…”
Section: Visual Information Extractionmentioning
confidence: 99%
“…The research study [19] used the RVL-CDIP dataset that includes scanned document images of different categories, including invoices as one of the categories. It has 25,000 images of every category.…”
Section: Related Datasetsmentioning
confidence: 99%
“…Grid based methods [1], [7], [9] exploit the textual and spatial information of a document by building a grid in which pixels are encoded using character-or token-level embeddings. This grid is then fed to a convolutional encoderdecoder network.…”
Section: Introductionmentioning
confidence: 99%
“…[1]), or by passing the image through a separate encoder (e.g. [9]). One-hot character encoding was used in [1], while in [9] static word embeddings were utilized.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation