2019
DOI: 10.20944/preprints201909.0101.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Abstract: Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars a… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…Not only the text recognition task, but also the segmentation workflows needed to be covered, preferably in one single software package. OCR4All (Reul et al, 2019) was the chosen software package, including the training tasks of HTR models for the Spanish language. This software contains several modules for the image pre-processing phases (binarization, noise removal and paragraph and line segmentation), and also has a complete infrastructure for training, evaluation and inference of text recognition models based on CNN-LSTM architectures.…”
Section: Ocr4all Platformmentioning
confidence: 99%
“…Not only the text recognition task, but also the segmentation workflows needed to be covered, preferably in one single software package. OCR4All (Reul et al, 2019) was the chosen software package, including the training tasks of HTR models for the Spanish language. This software contains several modules for the image pre-processing phases (binarization, noise removal and paragraph and line segmentation), and also has a complete infrastructure for training, evaluation and inference of text recognition models based on CNN-LSTM architectures.…”
Section: Ocr4all Platformmentioning
confidence: 99%
“…First, Lace 6 is a GUI-based platform for the visualization and manual correction of OCR output, designed for automatic conversion to TEI-encoded XML. Second, OCR4all [14] is an open source OCR tool explicitly developed for users with no prior technical background, and especially those working on the earliest printed books. It implements an iterative workflow that allows for rapidly training very accurate OCR models for specific publications or publication series.…”
Section: Related Workmentioning
confidence: 99%
“…In the second workflow, images were segmented and pre-processed before being passed to Tesseract. However, since Tesseract could not be prevented from resegmenting words in its own fashion 14 , this workflow performed poorly (results not reported).…”
Section: Comparing Ocr Pipelines 521 Tesseract/ocr-dmentioning
confidence: 99%
“…In Germany, there is the OCR-D 4 coordination project, with 8 project modules focused on various stages of OCR. Also in Germany, OCR4all [28] has recently published an open-source tool providing a (semi-)automatic OCR Workflow for historical prints. The workflow was created using different open-source tools.…”
Section: Related Workmentioning
confidence: 99%