Importance of Textlines in Historical Document Classification

Kišš, Martin; Jan, Kohút,; Beneš, Karel; Hradiš, Michal

doi:10.1007/978-3-031-06555-2_11

Cited by 6 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Page classification. Page classification of historical documents usually consists in associating each page with a class that describes the period of the document, its place of origin, the script used or the author of the document [39,24]. In our case, the classification task would help to discard pages without any act (cover, blank page, index.…”

Section: Step-by-step Workflowmentioning

confidence: 99%

Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records

Tarride¹,

Maarand²,

Boillet³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage, and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19-20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale.

show abstract

Section: Step-by-step Workflowmentioning

confidence: 99%

Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records

Tarride¹,

Maarand²,

Boillet³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Although Tesseract's accuracy varies across different datasets, the accuracy of the OCR engine can be significantly improved through image preprocessing techniques [5][6][7]. For instance, studies have shown that Tesseract OCR achieves an F1 score of 0.163 on the Brno Mobile OCR Dataset [8], but through pre-processing, the F1 score can increase up to 0.729 [9]. To evaluate the impact of pre-processing on Tesseract's accuracy, we conducted a preliminary analysis using 560 images of phone screen menus captured in an indoor setup with a mounted camera.…”

Section: Introductionmentioning

confidence: 99%

Gpu-based and streaming-enabled implementation of pre-processing flow towards enhancing optical character recognition accuracy and efficiency

Serhan,

Parker,

Dhruv

et al. 2023

Cluster Comput

View full text Add to dashboard Cite

Research has demonstrated that digital images can be pre-processed through operations such as scaling, rotation, and blurring to enhance the accuracy of optical character recognition (OCR) by emphasizing important features within the image. Our study employed the open-source Tesseract OCR and found that accuracy can be improved through pre-processing techniques including thresholding, rotation, rescaling, erosion, dilation, and noise removal, based on a dataset of 560 phone screen images. However, our CPU-based implementation of this process resulted in an average latency of 48.32 ms per image, which can hinder the processing of millions of images using OCR. To address this challenge, we parallelized the pre-processing flow on the Nvidia P100 GPU and executed it through a streaming approach, which reduced the latency to 0.825 ms and achieved a speedup factor of 58.6x compared to the serial execution. This implementation enables the use of a GPU-based OCR engine to handle multiple sources of data streams with large-scale workloads.

show abstract

A survey of historical document image datasets

Nikolaidou

Seuret

Mokayed

et al. 2022

IJDAR

View full text Add to dashboard Cite

This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and label representation, and the different evaluation processes and benchmarks, finding appropriate datasets is a difficult task. This work fills this gap, presenting a meta-study on existing datasets. After a systematic selection process (according to PRISMA guidelines), we select 65 studies that are chosen based on different factors, such as the year of publication, number of methods implemented in the article, reliability of the chosen algorithms, dataset size, and journal outlet. We summarize each study by assigning it to one of three pre-defined tasks: document classification, layout structure, or content analysis. We present the statistics, document type, language, tasks, input visual aspects, and ground truth information for every dataset. In addition, we provide the benchmark tasks and results from these papers or recent competitions. We further discuss gaps and challenges in this domain. We advocate for providing conversion tools to common formats (e.g., COCO format for computer vision tasks) and always providing a set of evaluation metrics, instead of just one, to make results comparable across studies.

show abstract

Importance of Textlines in Historical Document Classification

Cited by 6 publications

References 8 publications

Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records

Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records

Gpu-based and streaming-enabled implementation of pre-processing flow towards enhancing optical character recognition accuracy and efficiency

A survey of historical document image datasets

Contact Info

Product

Resources

About