Transforming paper documents into XML format with WISDOM++

Altamura, O.; Esposito, Floriana; Malerba, Donato

doi:10.1007/pl00013569

Cited by 61 publications

(33 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…1 is a document analysis system that can transform textual black and white paper documents into XML format [2]. This is a complex process involving several steps.…”

Section: Wisdom++mentioning

confidence: 99%

“…By sorting the dictionary with respect to MaxTF(i,t), words occurring frequently only in one document might be favored. By sorting each class dictionary according to the product MaxTF(i,t)*PF(i,t) 2 , briefly denoted as MaxTF-PF 2 (Max Term FrequencySquare Page Frequency) measure, the effect of this phenomenon is kept under control. Moreover, common words used in documents of a given class will appear in the first entries of the corresponding class dictionary.…”

Section: The Feature Extractor Modulementioning

confidence: 99%

See 1 more Smart Citation

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

Berardi

Lapi

Malerba

2004

Document Analysis Systems VI

Self Cite

View full text Add to dashboard Cite

Abstract. In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different machine learning techniques are applied. Experimental results on a set of biomedical multi-page documents are discussed and future directions are drawn.

show abstract

“…1 is a document analysis system that can transform textual black and white paper documents into XML format [2]. This is a complex process involving several steps.…”

Section: Wisdom++mentioning

confidence: 99%

Section: The Feature Extractor Modulementioning

confidence: 99%

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

Berardi

Lapi

Malerba

2004

Document Analysis Systems VI

Self Cite

View full text Add to dashboard Cite

show abstract

“…The document classification components of the WISDOM++ system (Altamura et al, 2001) are based on first-order learning algorithms (Esposito et al, 2000). Another advantage of such systems is their flexibility compared to the non-learning based systems.…”

Section: Introductionmentioning

confidence: 99%

“…Markup languages are a good example of representation means with such qualities. The system presented in (Worring and Smeulders, 1999) uses HTML as its final output form, while Altamura et al (2001) use XML. More abstract representations are labeled and weighted graphs.…”

Section: Introductionmentioning

confidence: 99%

Thick 2D relations for document understanding

Aiello

Smeulders

2004

Information Sciences

View full text Add to dashboard Cite

We use a propositional language of qualitative rectangle relations to detect the reading order from document images. To this end, we define the notion of a document encoding rule and we analyze possible formalisms to express document encoding rules such as L A T E X and SGML. Document encoding rules expressed in the propositional language of rectangles are used to build a reading order detector for document images. In order to achieve robustness and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced. The framework is tested on a collection of heterogeneous document images showing recall rates up to 89%.

show abstract

“…In this paper we present the multi-page DIA system WISDOM++ (http://www.di.uniba.it/~malerba/wisdom++/), whose architecture is knowledgebased and supports all the processing steps required for semantic indexing and storing in XML format [1]. More precisely, the transformation process performed by WISDOM++ consists of the preprocessing of the raster image of a scanned paper document, the segmentation of the preprocessed raster image into basic layout components, the classification of basic layout components according to the type of content (e.g., text, graphics, etc.…”

Section: Introductionmentioning

confidence: 99%

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

Malerba

Ceci

Berardi

2003

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.

show abstract

Transforming paper documents into XML format with WISDOM++

Abstract: Abstract. The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems.

Cited by 61 publications

References 21 publications

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

Thick 2D relations for document understanding

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

Contact Info

Product

Resources

About