2021
DOI: 10.3390/data6070078
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

Abstract: The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…This system uses connected components and pixel analysis for classifying elements such as paragraphs, graphics, images, and tables in the document. In [11] the authors propose a dataset for unstructured invoice documents that covers a wide range of layouts, which is designed to generalize key field extraction tasks for unstructured documents. The dataset is evaluated using various feature extraction techniques as well as Artificial Intelligence methods.…”
Section: B Data Extractionmentioning
confidence: 99%
See 2 more Smart Citations
“…This system uses connected components and pixel analysis for classifying elements such as paragraphs, graphics, images, and tables in the document. In [11] the authors propose a dataset for unstructured invoice documents that covers a wide range of layouts, which is designed to generalize key field extraction tasks for unstructured documents. The dataset is evaluated using various feature extraction techniques as well as Artificial Intelligence methods.…”
Section: B Data Extractionmentioning
confidence: 99%
“…Let us particularly mention the works of Alfonseca et al [3] and R. Evans [40] that use the notion of "open domain". Recently, data sets have been made available for NER related to invoices [11]. From a practical point of view, Mikolov et al [90] demonstrate the benefit of using vector representation of words and also that it is possible to train a model of neural networks on a large training set, including a large number of sentences with approximately one billion words and a vocabulary of more than one million different words.…”
Section: Addressing Specific Information Extraction: Named Entity Rec...mentioning
confidence: 99%
See 1 more Smart Citation
“…The existence of a dataset in order to train the network for segmentation is another problem in the case of training networks for extracting relevant information from structured documents. Baviskar et al [14] Provides a well annotated dataset which helps in training a network which can recognize varying invoices of the same format. However, in order to train a model which can recognize a wide range of invoices, the network should be trained across a large number of invoices as well.…”
Section: Related Workmentioning
confidence: 99%