OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank

doi:10.20944/preprints201909.0101.v1

Cited by 10 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Not only the text recognition task, but also the segmentation workflows needed to be covered, preferably in one single software package. OCR4All (Reul et al, 2019) was the chosen software package, including the training tasks of HTR models for the Spanish language. This software contains several modules for the image pre-processing phases (binarization, noise removal and paragraph and line segmentation), and also has a complete infrastructure for training, evaluation and inference of text recognition models based on CNN-LSTM architectures.…”

Section: Ocr4all Platformmentioning

confidence: 99%

End-to-end platform evaluation for Spanish Handwritten Text Recognition

Xamena

Barboza

Orozco

2021

CyT

View full text Add to dashboard Cite

The task of automated recognition of handwritten texts requires various phases and technologies both optical and language related. This article describes an approach for performing this task in a comprehensive manner, using machine learning throughout all phases of the process. In addition to the explanation of the employed methodology, it describes the process of building and evaluating a model of manuscript recognition for the Spanish language. The original contribution of this article is given by the training and evaluation of Offline HTR models for Spanish language manuscripts, as well as the evaluation of a platform to perform this task in a complete way. In addition, it details the work being carried out to achieve improvements in the models obtained, and to develop new models for different complex corpora that are more difficult for the HTR task.

show abstract

Section: Ocr4all Platformmentioning

confidence: 99%

End-to-end platform evaluation for Spanish Handwritten Text Recognition

Xamena

Barboza

Orozco

2021

CyT

View full text Add to dashboard Cite

show abstract

“…First, Lace 6 is a GUI-based platform for the visualization and manual correction of OCR output, designed for automatic conversion to TEI-encoded XML. Second, OCR4all [14] is an open source OCR tool explicitly developed for users with no prior technical background, and especially those working on the earliest printed books. It implements an iterative workflow that allows for rapidly training very accurate OCR models for specific publications or publication series.…”

Section: Related Workmentioning

confidence: 99%

“…In the second workflow, images were segmented and pre-processed before being passed to Tesseract. However, since Tesseract could not be prevented from resegmenting words in its own fashion 14 , this workflow performed poorly (results not reported).…”

Section: Comparing Ocr Pipelines 521 Tesseract/ocr-dmentioning

confidence: 99%

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

Romanello¹,

Najem-Meyer²,

Robertson³

2021

Preprint

View full text Add to dashboard Cite

Together with critical editions and translations, commentaries are one of the main genres of publication in literary and textual scholarship, and have a century-long tradition. Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts. In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries. Our results show that Kraken + Ciaconna reaches a substantially lower character error rate (CER) than Tesseract/OCR-D on commentary sections with high density of polytonic Greek text (average CER 7% vs. 13%), while Tesseract/OCR-D is slightly more accurate than Kraken + Ciaconna on text sections written predominantly in Latin script (average CER 8.2% vs. 8.4%). As part of this paper, we also release GT4HistComment, a small dataset with OCR ground truth for 19 th classical commentaries and Pogretra, a large collection of training data and pre-trained models for a wide variety of ancient Greek typefaces.

show abstract

“…In Germany, there is the OCR-D 4 coordination project, with 8 project modules focused on various stages of OCR. Also in Germany, OCR4all [28] has recently published an open-source tool providing a (semi-)automatic OCR Workflow for historical prints. The workflow was created using different open-source tools.…”

Section: Related Workmentioning

confidence: 99%

Optical character recognition with neural networks and post-correction with finite state methods

Drobac

Lindén

2020

IJDAR

View full text Add to dashboard Cite

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.

show abstract

OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Cited by 10 publications

References 17 publications

End-to-end platform evaluation for Spanish Handwritten Text Recognition

End-to-end platform evaluation for Spanish Handwritten Text Recognition

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

Optical character recognition with neural networks and post-correction with finite state methods

Contact Info

Product

Resources

About