An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction

Zhang, Mengshi; Perelman, Daniel; Le, Vu; Gulwani, Sumit

doi:10.1109/icpr48806.2021.9413069

Cited by 5 publications

(6 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our technique also applies to general DSLs in different domains rather than just XPaths for web extraction. Ideas around exploiting compositionality and data invariance have also been explored in previous works: [5,12] use commonly reoccurring phrasal patterns for web extraction given a seed set; in the vision community, modular approaches such as convolutional neural networks have been used for document image extraction [17,34,52,55,58], and notably algorithms based on R-CNN [19] use selective search to focus attention on a small number of regions from the image (region proposals). Our core ideas are similarly based around localised regions, but we detect them by identifying landmarks that present a common kind of invariance in formed documents.…”

Section: Robustness Of Experimental Resultsmentioning

confidence: 99%

“…While this approach shows improved robustness, it still generates global programs that can fail with irrelevant changes to the document format, and we show in this work how our compositional synthesis approach performs better empirically in practice. There has been very limited work in the area of synthesis for document image extraction, but notable works in specialized areas include [52], where concepts from inductive logic programming are combined with neural approaches, and [58], which combines symbolic reasoning with CNNs, though interpretable programs are not generated.…”

Section: Robustness Of Experimental Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Landmarks and Regions: A Robust Approach to Data Extraction

Parthasarathy¹,

Pattanaik²,

Khatry³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to format changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark based extraction approach in a tool LRSyn, and show extensive evaluation on documents in HTML as

show abstract

Section: Robustness Of Experimental Resultsmentioning

confidence: 99%

Section: Robustness Of Experimental Resultsmentioning

confidence: 99%

Landmarks and Regions: A Robust Approach to Data Extraction

Parthasarathy¹,

Pattanaik²,

Khatry³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Deep learning techniques are now widely used to identify and extract tables in PDF documents [46], [152]. This aspect will be detailed later.…”

Section: B Data Extractionmentioning

confidence: 99%

An Overview of Data Extraction From Invoices

Saout,

Lardeux,

Saubion

2024

IEEE Access

View full text Add to dashboard Cite

This paper provides a comprehensive overview of the process for information retrieval from invoices. Invoices serve as proof of purchase and contain important information, including the date, description, quantity, and the price of goods or services, as well as the terms of payment. Companies must process invoices quickly and accurately to maintain proper financial records. To automate this workflow, commercial systems have been developed. Despite the complexity involved, realizing automated processing of invoices necessitates the harmonious integration of a wide range of techniques and methods. While several surveys have shed light on different aspects of this workflow, our objective in this paper is to present a synthetic view of the process and emphasize the most pertinent challenges. We discuss the digitalization of invoices and the use of natural language processing techniques to extract relevant information. We also review machine learning and deep learning techniques that are widely used to handle the variability of layouts, minimize end-user tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for different data extraction tasks, addressing both information extraction and structure recognition for invoice processing. Specifically, we focus on table processing , paying particular attention to graph-based approaches.

show abstract

“…According to Hashmi et al (2021a), on “ICDAR‐2013 Table Competition” dataset (Göbel et al, 2013), F 1 ‐score is close to 1.0 for TD when the threshold of “Intersection Over Union” (IOU) equals 0.5, and this score reaches 0.95 (IOU = 0.5) for TSR. However, the results may be not as good if one uses stricter metrics (Zhang et al, 2021a) or more complicated tables (Zhang et al, 2022). Particularly, this fact is also confirmed by developing the domain‐specific benchmarks (Adams et al, 2021; Desai et al, 2021).…”

Section: Problem Scopementioning

confidence: 99%

Table understanding: Problem overview

Shigarov

2022

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Tables are probably the most natural way to represent relational data in various media and formats. They store a large number of valuable facts that could be utilized for question answering, knowledge base population, natural language generation, and other applications. However, many tables are not accompanied by semantics for the automatic interpretation of the information they present. Table Understanding (TU) aims at recovering the missing semantics that enables the extraction of facts from tables. This problem covers a range of issues from table detection in document images to semantic table interpretation with the help of external knowledge bases. To date, the TU research has been ongoing on for 30 years. Nevertheless, there is no common point of view on the scope of TU; the terminology still needs agreement and unification. In recent years, science and technology have shown a rapidly increasing interest in TU. Nowadays, it is especially important to check the meaning of this research problem once again. This article gives a comprehensive characterization of the TU problem, including a description of its subproblems, tasks, subtasks, and applications. It also discusses the common limitations used in the existing problem statements and proposes some directions for further research that would help overcome the corresponding limitations. This article is categorized under: Algorithmic Development > Text Mining Algorithmic Development > Web Mining

show abstract

An Integrated Approach of Deep Learning and Symbolic Analysis for Digital PDF Table Extraction

Cited by 5 publications

References 21 publications

Landmarks and Regions: A Robust Approach to Data Extraction

Landmarks and Regions: A Robust Approach to Data Extraction

An Overview of Data Extraction From Invoices

Table understanding: Problem overview

Contact Info

Product

Resources

About