Automatic indexing of scanned documents: a layout-based approach

Esser, Daniel E.; Schuster, Daniel; Muthmann, Klemens; Berger, Michael; Schill, Alexander

doi:10.1117/12.908542

Cited by 28 publications

(16 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In literature there are several works that addressed the problem of document structuring for user application dealing with semantic search engine. In [9] the authors propose an approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose is applied the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.…”

Section: Related Workmentioning

confidence: 99%

A Semantic Search Engine in the Cloud

Amato

Gargiulo

Mazzeo

et al. 2013

2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

View full text Add to dashboard Cite

Due to the ease of data production within the Internet era, knowledge workers are increasingly overwhelmed by information from multiple information sources and yet still find it hard to navigate and search for accessing the specific information required for the task at hand. This implies that knowledge worker productivity is reduced and that organizations may be making decisions on the basis of incomplete knowledge. Most search engines in use today strongly rely on keywords matching and on the ability of the user in the query expression. This leads to the retrieval of a large amount of irrelevant information with a direct impact on the user that spends a lot of time in browsing the results and/or to construct more complex queries to refine the search output. To overcome this limitation semantic-based solution are increasingly adopted.In this work we propose a general architecture that implements a semantic search engine in the cloud that exploits semantic technologies to retrieve and present the right information to the user. Our search engine is aimed at providing support in the task of document composition, suggesting to the user the adequate section that could be inserted within a document.

show abstract

Section: Related Workmentioning

confidence: 99%

A Semantic Search Engine in the Cloud

Amato

Gargiulo

Mazzeo

et al. 2013

2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

View full text Add to dashboard Cite

show abstract

“…Cesarini et al [6] learns a database of keywords for each template and fall back to a global database of keywords. Esser et al [7] uses a database of absolute positions of fields for each template. Medvet et al [8] uses a database of manually created (field, pattern, parser) triplets for each template, designs a probabilistic model for finding the most similar pattern in a template, and extracts the value with the associated parser.…”

Section: Related Workmentioning

confidence: 99%

“…A number of systems have been proposed that rely on first classifying the template, e.g. Intellix [3], ITESOFT [4], smartFIX [5] and others [6], [7], [8]. As these systems rely on having seen the template beforehand, they cannot accurately handle documents from unseen templates.…”

Section: Introductionmentioning

confidence: 99%

CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks

Palm

Winther

Laws³

2017

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

We present CloudScan; an invoice analysis system that requires zero configuration or upfront annotation.In contrast to previous work, CloudScan does not rely on templates of invoice layout, instead it learns a single global model of invoices that naturally generalizes to unseen invoice layouts.The model is trained using data automatically extracted from end-user provided feedback. This automatic training data extraction removes the requirement for users to annotate the data precisely.We describe a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system.We train and evaluate the system on 8 important fields using a dataset of 326,471 invoices. The recurrent neural network and baseline model achieve 0.891 and 0.887 average F1 scores respectively on seen invoice layouts. For the harder task of unseen invoice layouts, the recurrent neural network model outperforms the baseline with 0.840 average F1 compared to 0.788.

show abstract

“…Many documents used in enterprises and governments are typically derived from templates, especially forms completed by users, e.g., tax forms, medical forms, job application forms, etc. Given a set of templates and a scanned paper document, an open problem is to quickly and accurately identify which template this scanned document was originally derived from (Esser et al, 2011). To solve this problem, a number of systems based on labeled information have been proposed and developed (Cunningham et al, 2002;T.…”

Section: Introductionmentioning

confidence: 99%

“…Studies have been performed to use image features to match a scanned document to its template (Hu et al, 2000). Some of these studies still require labeled information (Esser et al, 2011), while others require consistent high-quality data in order to function properly.…”

Section: Introductionmentioning

confidence: 99%

Robust Template Identification of Scanned Documents

Feng

Youssef

Sudarsan³

2012

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval

View full text Add to dashboard Cite

Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template of scanned documents. The proposed algorithm works on imperfect OCR output and document collections containing very similar templates. Through experiment and analysis, this novel probabilistic approach is shown to achieve high accuracy on different data sets.

show abstract

Automatic indexing of scanned documents: a layout-based approach

Cited by 28 publications

References 9 publications

A Semantic Search Engine in the Cloud

A Semantic Search Engine in the Cloud

CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks

Robust Template Identification of Scanned Documents

Contact Info

Product

Resources

About