Document image content inventories

“…Our classifier, discussed in [4,6,7], is an approximation of kNearest Neighbors and is used to classify each pixel in a document image by assigning it a class label, such as machine print, handwriting, photograph, etc. Features are extracted from each training sample (pixel) from a small, local window of no more than 20 pixels wide.…”

Section: Our Ground Truth Policymentioning

confidence: 99%

“…In previous work [4,5,9,10, 1], we have described a research program investigating versatile algorithms for document image content extraction, that is locating regions containing machine printed text, handwriting, photographs, etc. This program seeks to solve this problem in full generality, handling a vast variety of document and image types.…”

Section: Introductionmentioning

confidence: 99%

Truthing for Pixel-Accurate Segmentation

Moll

Baird

Chang

2008

2008 the Eighth IAPR International Workshop on Document Analysis Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…The bottom-up strategy, on the other hand, performs classification on small, naturally given parts of a document e.g. pixels, connected components, or individual strokes in online documents [2,8,18]. A clustering algorithm may follow to group small entities into larger, meaningful segments.…”

Section: Introductionmentioning

confidence: 99%

“…Top-down methods are prevailing if the document structure can be analyzed rather easily [9] (as in scientific papers or newspapers, for example). Pixel classification is preferred where the structure is difficult to recognize [2] (as in magazines where text and images may be mixed rather irregularly). In the field of online handwritten document analysis, the distinction of text and non-text is accomplished with a bottom-up approach in [8,13] where single strokes, as the smallest entities, are classified.…”

Section: Introductionmentioning

confidence: 99%

Text versus non-text distinction in online handwritten documents

Indermühle

Bunke

Breuel

2010

Proceedings of the 2010 ACM Symposium on Applied Computing

View full text Add to dashboard Cite

The aim of this paper is to explore how well the task of text vs. nontext distinction can be solved in online handwritten documents using only offline information. Two systems are introduced. The first system generates a document segmentation first. For this purpose, four methods originally developed for machine printed documents are compared: x-y cut, morphological closing, Voronoi segmentation, and whitespace analysis. A state-of-the art classifier then distinguishes between text and non-text zones. The second system follows a bottom-up approach that classifies connected components. Experiments are performed on a new dataset of online handwritten documents containing different content types in arbitrary arrangements. The best system assigns 94.3% of the pixels to the correct class.

show abstract

Document image content inventories

Cited by 20 publications

References 8 publications

Document Content Extraction Using Automatically Discovered Features

Document Content Extraction Using Automatically Discovered Features

Truthing for Pixel-Accurate Segmentation

Text versus non-text distinction in online handwritten documents

Contact Info

Product

Resources

About