Attend, Copy, Parse End-to-end Information Extraction from Documents

Palm, Rasmus; Laws, Florian; Winther, Ole

doi:10.1109/icdar.2019.00060

Cited by 48 publications

(30 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It implies that it is useful to train over more data [10], [2], [19], [11] . 4) BERT: Bidirectional Encoder Representations for Transformers (BERT) [92] , [10] , [2] , [19] , [86], is nowadays the latest word embedding approach, that is effectively used in numerous biomedical and other text mining tasks. BERT learns the text representation from both the directions to better understand the context and the relationship.…”

Section: ) Named Entity Recognition (Ner)mentioning

confidence: 99%

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

et al. 2021

View full text Add to dashboard Cite

The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3.The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis.

show abstract

Section: ) Named Entity Recognition (Ner)mentioning

confidence: 99%

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

et al. 2021

View full text Add to dashboard Cite

show abstract

“…• Dataset quality [25], [26] Missing data values and few other errors lead to an insignificant extraction of data.…”

Section: Poor Qualitymentioning

confidence: 99%

Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches

2021

View full text Add to dashboard Cite

The daily transaction of an organization generates a vast amount of unstructured data such as invoices and purchase orders. Managing and analyzing unstructured data is a costly affair for the organization. Unstructured data has a wealth of hidden valuable information. Extracting such insights automatically from unstructured documents can significantly increase the productivity of an organization. Thus, there is a huge demand to develop a tool that can automate the extraction of key fields from unstructured documents. Researchers have used different approaches for extracting key fields, but the lack of annotated and highquality datasets is the biggest challenge. Existing work in this area has used standard and custom datasets for extracting key fields from unstructured documents. Still, the existing datasets face some serious challenges, such as poor-quality images, domain-related datasets, and a lack of data validation approaches to evaluate data quality. This work highlights the detailed process flow for endto-end key fields extraction from unstructured documents. This work presents a high-quality, multi-layout unstructured invoice documents dataset assessed with a statistical data validation technique. The proposed multi-layout unstructured invoice documents dataset is highly diverse in invoice layouts to generalize key field extraction tasks for unstructured documents. The proposed multilayout unstructured invoice documents dataset is evaluated with various feature extraction techniques such as Glove, Word2Vec, FastText, and AI approaches such as BiLSTM and BiLSTM-CRF. We also present the comparative analysis of feature extraction techniques and AI approaches on the proposed multi-layout unstructured invoice document dataset. We attained the best results with BiLSTM-CRF model. INDEX TERMS Artificial Intelligence (AI), information extraction, key field extraction, Named Entity Recognition (NER), template-free invoice processing, unstructured data.

show abstract

“…Some convolutional layers are then applied to these models of document to obtain the token representations. In addition to better understanding the document layout, some authors [18,25] also include the pixel values of the document images in the input for capturing clues not conveyed by the text modality such as table ruling lines, logos and stamps.…”

Section: Related Work On Information Extraction (Ie)mentioning

confidence: 99%

Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Sage

Douzon

Aussem

et al. 2021

Document Analysis and Recognition – ICDAR 2021 Workshops

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Attend, Copy, Parse End-to-end Information Extraction from Documents

Cited by 48 publications

References 33 publications

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches

Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Contact Info

Product

Resources

About