Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. e usages are o en motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents. In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: e Gallica digital library from the National Library of France. It accounts for more than 100M OCRed documents and receives 80M search queries every year. In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English-and French-wri en documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submi ed to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.
Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.
The widespread deployment of facial recognition-based biometric systems has made facial presentation attack detection (face anti-spoofing) an increasingly critical issue. This survey thoroughly investigates facial Presentation Attack Detection (PAD) methods that only require RGB cameras of generic consumer devices over the past two decades. We present an attack scenario-oriented typology of the existing facial PAD methods, and we provide a review of over 50 of the most influenced facial PAD methods over the past two decades till today and their related issues. We adopt a comprehensive presentation of the reviewed facial PAD methods following the proposed typology and in chronological order. By doing so, we depict the main challenges, evolutions and current trends in the field of facial PAD and provide insights on its future research. From an experimental point of view, this survey paper provides a summarized overview of the available public databases and an extensive comparison of the results reported in PAD-reviewed papers.
Digital document categorization based on logo spotting and recognition has raised a great interest in the research community because logos in documents are sources of information for categorizing documents with low costs. In this paper, we present an approach to improve the result of our method for logo spotting and recognition based on keypoint matching and presented in our previous paper [7]. First, the keypoints from both the query document images and a given set of logos (logo gallery) are extracted and described by SIFT, and are matched in the SIFT feature space. Secondly, logo segmentation is performed using spatial density-based clustering. The contribution of this paper is to add a third step where homography is used to filter the matched keypoints as a postprocessing. And finally, in the decision stage, logo classification is performed by using an accumulating histogram. Our approach is tested using a well-known benchmark database of real world documents containing logos, and achieves good performances compared to state-of-the-art approaches.
This article presents a method for generating semi-synthetic images of old documents where the pages might be torn (not flat). By using only 2D deformation models, most existing methods give non-realistic synthetic document images. Thus, we propose to use 3D approach for reproducing geometric distortions in real documents. First, a new proposed texture coordinate generation technique extracts texture coordinates of each vertex in the document shape (mesh) resulting from 3D scanning of a real degraded document. Then, any 2D document image can be overlayed on the mesh by using an existing texture image mapping method. As a result, many complex real geometric distortions can be integrated in generated synthetic images. These images then can be used for enriching training sets or for performance evaluation. The degradation method here is jointly used with the character degradation model we proposed in [1] to generate the 6000 semi-synthetic degraded images of the music score removal staff line competition of ICDAR 2013 1 .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.