Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

Baviskar, Dipali; Ahirrao, Swati; Kotecha, Ketan

doi:10.3390/data6070078

Cited by 6 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This system uses connected components and pixel analysis for classifying elements such as paragraphs, graphics, images, and tables in the document. In [11] the authors propose a dataset for unstructured invoice documents that covers a wide range of layouts, which is designed to generalize key field extraction tasks for unstructured documents. The dataset is evaluated using various feature extraction techniques as well as Artificial Intelligence methods.…”

Section: B Data Extractionmentioning

confidence: 99%

“…Let us particularly mention the works of Alfonseca et al [3] and R. Evans [40] that use the notion of "open domain". Recently, data sets have been made available for NER related to invoices [11]. From a practical point of view, Mikolov et al [90] demonstrate the benefit of using vector representation of words and also that it is possible to train a model of neural networks on a large training set, including a large number of sentences with approximately one billion words and a vocabulary of more than one million different words.…”

Section: Addressing Specific Information Extraction: Named Entity Rec...mentioning

confidence: 99%

“…Topic [18] open source OCR solution [89] handwritten character recognition [28] handwritten OCR [96] text recognition using deep learning [62] deep learning based OCR [92] OCR solution including image to speech transformation [98] benchmark sets for OCR [44] OpenCV system [158] neural network based OCR [58] description of Icdar2019 competition on scanned receipt [10] A survey into OCR specialized for medical reports. [77] A technique based on transformer architecture for OCR and a benchmark with modern solutions Reference Topic [133] seminal work on data extraction [7] computational-geometry algorithms for analyzing document structures [2] handling multiple types of data structures [146] considering relations between data [131] orientation of documents [16] document layout analysis [11] data sets for evaluation [81] seminal work on pdf documents management [48] data extraction from tables [113] table extraction for pdf documents [41] table detection for multipage pdf documents [24] solving of the maximum independent set of rectangles problem [149] pdf2table : method for extracting table [46] graph neural network for extracting tables from pdf documents [152] deep learning for pdf table extraction [104] presentation of TAO for table detection and extraction Reference Topic [85] seminal work on NER [135] NER Challenge at CoNLL [35] ACE program : challenge for NER systems [20] empirical study of NER [3] procedure to automatically extend an ontology with domain specific knowledge [40] system for NER in the open domain…”

Section: Referencementioning

confidence: 99%

See 2 more Smart Citations

An Overview of Data Extraction From Invoices

Saout,

Lardeux,

Saubion

2024

IEEE Access

View full text Add to dashboard Cite

This paper provides a comprehensive overview of the process for information retrieval from invoices. Invoices serve as proof of purchase and contain important information, including the date, description, quantity, and the price of goods or services, as well as the terms of payment. Companies must process invoices quickly and accurately to maintain proper financial records. To automate this workflow, commercial systems have been developed. Despite the complexity involved, realizing automated processing of invoices necessitates the harmonious integration of a wide range of techniques and methods. While several surveys have shed light on different aspects of this workflow, our objective in this paper is to present a synthetic view of the process and emphasize the most pertinent challenges. We discuss the digitalization of invoices and the use of natural language processing techniques to extract relevant information. We also review machine learning and deep learning techniques that are widely used to handle the variability of layouts, minimize end-user tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for different data extraction tasks, addressing both information extraction and structure recognition for invoice processing. Specifically, we focus on table processing , paying particular attention to graph-based approaches.

show abstract

Section: B Data Extractionmentioning

confidence: 99%

Section: Addressing Specific Information Extraction: Named Entity Rec...mentioning

confidence: 99%

Section: Referencementioning

confidence: 99%

See 1 more Smart Citation

An Overview of Data Extraction From Invoices

Saout,

Lardeux,

Saubion

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The existence of a dataset in order to train the network for segmentation is another problem in the case of training networks for extracting relevant information from structured documents. Baviskar et al [14] Provides a well annotated dataset which helps in training a network which can recognize varying invoices of the same format. However, in order to train a model which can recognize a wide range of invoices, the network should be trained across a large number of invoices as well.…”

Section: Related Workmentioning

confidence: 99%

Automated invoice data extraction using image processing

Manjunath¹,

Nayak²,

Nishith³

et al. 2023

IJ-AI

View full text Add to dashboard Cite

Manually processing invoices which are in the form of scanned photocopies is a time-consuming process. There is a need to automate the task of extraction of data from the invoices with a similar format. In this paper we investigate and analyse various techniques of image processing and text extraction to improve the results of the optical character recognition (OCR) engine, which is applied to extract the text from the invoice. This paper also proposes the design and implementation of a web enabled invoice processing system (IPS). The IPS consists of an annotation tool and an extraction tool. The annotation tool is used to mark the fields of interest in the invoice which are to be extracted. The extraction tool makes use of opensource computer vision library (OpenCV) algorithms to detect text. The proposed system was tested on more than 25 types of invoices with the average accuracy score lying between 85% and 95%. Finally, to provide ease of use, a web application is developed which also presents the results in a structured format. The entire system is designed so as to provide flexibility and automate the process of extracting details of interest from the invoices.

show abstract