An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

Berardi, Margherita; Lapi, Michele; Malerba, Donato

doi:10.1007/978-3-540-28640-0_17

Cited by 5 publications

(2 citation statements)

References 10 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nowadays pattern recognition techniques are often used [3] and they allow to perform indexing starting from a first phase of knowledge extraction. This techniques assume the availability of scanned documents with a resolution enough high to perform some OCR, and it's not our case (there are several handy written documents for example) at least for the treatment of the paper documents.…”

Section: Related Workmentioning

confidence: 99%

How can ontologies support enterprise digital and paper archives?

Barchetti

Guido

Pulimeno

et al. 2008

Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology - CSTST '08

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

How can ontologies support enterprise digital and paper archives?

Barchetti

Guido

Pulimeno

et al. 2008

Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology - CSTST '08

View full text Add to dashboard Cite

“…The extraction of such physical layout information is traditionally concerned with scanned images (e.g. OCR) [3] [4], but it is difficult to extract the layout information from electronic documents and engineering drawings. In this paper, we propose a document analysis method, which extracts text and layout information from various documents.…”

Section: Introductionmentioning

confidence: 99%

Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language

Hirano

Okano

Okada

et al. 2007

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

View full text Add to dashboard Cite

We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the Page Description Language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.

show abstract