In this paper, we describe a flexible form-reader system capable of extracting textual information from accounting documents, like invoices and bills of service companies. In this kind of document, the extraction of some information fields cannot take place without having detected the corresponding instruction fields, which are only constrained to range in given domains. We propose modeling the document's layout by means of attributed relational graphs, which turn out to be very effective for form registration, as well as for performing a focussed search for instruction fields. This search is carried out by means of a hybrid model, where proper algorithms, based on morphological operations and connected components, are integrated with connectionist models. Experimental results are given in order to assess the actual performance of the system.
In this paper we describe a method for classibing document images on the basis of their physical layout. The layout is fiescribed by means of a hierarchical description, the Modijied X-Y tree, that is derived by the classical X-Y tree segmentation aLgorithm taking into account cuts along lines in addition to cuts along white spaces between blocks. In order to reduce problems due to noise and skew of the input image, the Modijied X -Y tree is built on top of regions extracted by a commercial OCR. The tree is afterwards coded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificial neural network that is trained to elassib document images.The system is applied to the classijication of documents belonging to Digital Libraries, examples of classes taken into account are "title page", "index", "regular page". Many tests have been curried out on a data-set of more than 600 pages belonging to an on-line Digital Library. These tests allowed us to conclude that the use of MXY trees is advantageous with respect to the classical XY decomposition for this classification task.
In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume applications . Integration into production lines is under execution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.