International audienceThis article proposes an approach to predict the result of binarization algorithms on a given docu- ment image according to its state of degradation. In- deed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the inten- sity, quantity and location of the degradation. These features allow us to build prediction models of bina- rization algorithms that are very accurate according to R2 values and p-values. The prediction models are used to select the best binarization algorithm for a given doc- ument image. Obviously, this image-by-image strategy improves the binarization of the entire dataset
This article presents a way to evaluate the bleed-through defect on very old document images. We design measures to quantify and evaluate the verso ink bleeding through the paper onto the recto side. Measuring the bleed-through defect alows us to perform statistical analysis that are able to predict the feasibility of different post-scan tasks. In this article we choose to illustrate our measures by creating two OCR error rate predicting models based bleed-through evaluation. Two models are proposed, one for Abbyy FineReader * which is a very power-full commercial OCR and OCRopus † which is sponsored by Google. Both prediction models appears to be very accurate when calculating various statistic indicators.
This article proposes an approach to predict the result of binarization algorithms on a given document image according to its state of degradation. Indeed, historical documents suffer from different types of degradation which result in binarization errors. We intend to characterize the degradation of a document image by using different features based on the intensity, quantity and location of the degradation. These features allow us to build prediction models of binarization algorithms that are very accurate according to R 2 values and p-values. The prediction models are used to select the best binarization algorithm for a given document image. Obviously, this image-by-image strategy improves the binarization of the entire dataset.
Recto verso registration is an important step allowing detection of missing digitized pages, or location of the bleed-through defect over a page. An efficient way to restore or evaluate the bleed-through of a digitized document consists in analyzing at the same time both the recto side and the verso side. This method requires the two images to be aligned, registered. Without particular knowledge about document, recto verso registration is complex. Indeed, the only information that we can use to register the two is the bleed-through. Recto verso registration is complex because the recto's bleed-through is a highly degraded version of verso's ink pixels. Therefore, in this particular context, usual image comparison methods [1] are not very relevant. Nevertheless, document recto verso registration algorithms has been proposed [2], [3] [4], but these methods have important time computation costs, are noise sensitive and even fail in some cases where bleed-through is too light. The previous techniques are based on a pixel to pixel approach where the bleed-through is considered to be just a set of grey pixels. In this article, we consider the structure of the ink pixels on the verso page. The recto verso registration method presented here is based on the fact that bleed-through has the same structure that the ink on the verso side. The method registers the recto's bleed-through layout and the verso's ink layout, in two main steps, first a de-skewing algorithm is applied to both pages then, horizontal and vertical profiles are extracted and aligned with a dynamic time warping. The time complexity of our method is linear according to the image size. Moreover, experiments detailed at the end show the accuracy of our method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.