Semantics-based content extraction in typewritten historical documents

Antonacopoulos, Apostolos; Karatzas, Dìmosthenis

doi:10.1109/icdar.2005.215

Cited by 27 publications

(32 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The determination of word breaks is made in a manner that adapts to the writing style of the individual. For the case of historical machineprinted documents, Antonacopoulos and Karatzas [7] calculate and analyze the horizontal projection profile to identify suitable spaces between words. In the work of Gatos et al [8], word segmentation in historical machine-printed documents is based on a run length smoothing in the horizontal and vertical directions.…”

Section: Introductionmentioning

confidence: 99%

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

Makridis

Νικολάου

Gatos

2007

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

Makridis

Νικολάου

Gatos

2007

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

View full text Add to dashboard Cite

show abstract

“…The authors, with other colleagues in their laboratory, have implemented and experimented with various such thresholding techniques [5]. That work demonstrated that the indiscriminate application of any thresholding approach (global or local) does not yield as good results as when a method is applied only to the segmented text.…”

Section: Introductionmentioning

confidence: 99%

Flexible Text Recovery from Degraded Typewritten Historical Documents

Antonacopoulos¹,

Castilla²

2006

18th International Conference on Pattern Recognition (ICPR'06)

Self Cite

View full text Add to dashboard Cite

“…Features based upon a convex hull are insensitive to character fonts and sizes, the touching-character problem of various fonts and sizes can be handled even for heavily touching characters or italic-type overlapping characters without slant correction. Table 1 summarizes the characteristics of those approaches [1,8,10,14,4] mentioned above.…”

Section: Introductionmentioning

confidence: 99%

“…The most known of these segmentation algorithms are the following: projection analysis, connected component analysis, Run Length Smoothing Algorithm (RLSA), contour shape analysis and Hough transform. Representative examples of character segmentation methodologies are the following: Antonacopoulos and Karatzas [1] use the horizontal projection profile of each word segment for character segmentation in historical machine-printed documents. This approach cannot handle the case of overlapping characters.…”

Section: Introductionmentioning

confidence: 99%

“…The parameters are either user-specified and no training method is included [1,8,10,14,4] or selected through a training procedure over a set of "optimal" parameter values that are usually manually selected based on some assumption regarding the training data [11], [6]. In general, automatic selection of the free parameters is actually an optimization problem [2].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic unsupervised parameter selection for character segmentation

Vamvakas

Stamatopoulos

Gatos

et al. 2010

Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

View full text Add to dashboard Cite

A major difficulty for designing a document image segmentation methodology is the proper value selection for all involved parameters. This is usually done after experimentations or after involving a training supervised phase which is a tedious process since the corresponding segmentation ground truth has to be created. In this paper, we propose a novel automatic unsupervised parameter selection methodology that can be applied to the character segmentation problem. It is based on clustering of the entities obtained as a result of the segmentation for different values of the parameters involved in the segmentation method. The clustering is performed using features extracted from the segmented entities based on zones and from the area that is formed from the projections of the upper/lower and left/right profiles. Optimization of an appropriate intra-class distance measure yields the optimal parameter vector. The method is evaluated on two segmentation algorithms, namely a recently proposed character segmentation technique based on skeleton segmentation paths, as well as the well known RLSA technique. The proposed parameter selection method is capable of finding the segmentation parameters that correspond to the optimal or near optimal segmentation result, as this is determined by counting the number of matches between the entities detected by the segmentation algorithm and the entities in the ground truth.

show abstract

Semantics-based content extraction in typewritten historical documents

Abstract: This paper presents a flexible approach to extracting content from scanned historical documents using semantic information.

Cited by 27 publications

References 5 publications

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

Flexible Text Recovery from Degraded Typewritten Historical Documents

Automatic unsupervised parameter selection for character segmentation

Contact Info

Product

Resources

About