Abstract-This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. The proposed approach effectively segments the alignment problem into small subproblems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor. This approach is used to evaluate the OCR accuracy of real scanned books in English, French, German and Spanish.
Abstract-An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches are efficiently identified using the inverted index. These word images are then forwarded to the next stage where the configuration of visterms on the image plane are tested. Configuration matching is efficiently performed by projecting the visterms on the horizontal axis and searching for the Longest Common Subsequence (LCS) between the sequences of visterms. The proposed framework is tested on one English and two Telugu books. It is shown that the proposed method resolves a typical user query under 10 milliseconds providing very high retrieval accuracy (Mean Average Precision 0.93). The search accuracy for the English book is comparable to searching text in the high accuracy output of a commercial OCR engine.
Abstract-Spectral information alone is often not sufficient to distinguish certain terrain classes such as permanent crops like orchards, vineyards, and olive groves from other types of vegetation. However, instances of these classes possess distinctive spatial structures that can be observable in detail in very high spatial resolution images. This paper proposes a novel unsupervised algorithm for the detection and segmentation of orchards. The detection step uses a texture model that is based on the idea that textures are made up of primitives (trees) appearing in a near-regular repetitive arrangement (planting patterns). The algorithm starts with the enhancement of potential tree locations by using multi-granularity isotropic filters. Then, the regularity of the planting patterns is quantified using projection profiles of the filter responses at multiple orientations. The result is a regularity score at each pixel for each granularity and orientation. Finally, the segmentation step iteratively merges neighboring pixels and regions belonging to similar planting patterns according to the similarities of their regularity scores, and obtains the boundaries of individual orchards along with estimates of their granularities and orientations. Extensive experiments using Ikonos and QuickBird imagery as well as images taken from Google Earth show that the proposed algorithm provides good localization of the target objects even when no sharp boundaries exist in the image data.
Abstract-This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans be from the same copy of the book, or even the same edition. The three OCR outputs are combined using an algorithm which builds upon an technique which aligns two sequences at a time. In the algorithm a multiple sequence alignment of the scans is generated by stitching together pairwise alignments and is used in turn to construct a corrected text. The algorithm is able to remove OCR errors so long as the same error does not occur in multiple scans. The alignment works even if one of the editions includes an extra long introduction or additional footnotes. This scheme is used to generate improved versions from OCR texts taken from the Internet Archive. The accuracy of the original scans and the composite text are evaluated by comparing them to the version available from Project Gutenberg.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.