PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts

Vobl, Thorsten; Gotscharek, Annette; Reffle, Uli; Ringlstetter, Christoph; Schulz, Klaus U.

doi:10.1145/2595188.2595197

Cited by 24 publications

(21 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the context of historical OCR the interactive postcorrection tool PoCoTo 27 represents the state-of-the-art. The original PoCoTo introduced by Vobl et al [39] is a system developed to support the efficient interactive postcorrection of historical texts by offering several advanced features: Suspicious tokens of the OCR text are identified by a special language technology which is aware of historical language variations represented by rewrite rules like t → th (modern spelling vs. historical spelling) and can be corrected by choosing a word from a list of generated plausible correction candidates. The user does not have to perform this for every single word but can batch correct entire error series which for example can consist of identically misrecognized words or words that suffer from the same OCR error, for example the confusion of "e" and "c".…”

Section: Pocotomentioning

confidence: 99%

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

et al. 2019

View full text Add to dashboard Cite

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19 th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

show abstract

Section: Pocotomentioning

confidence: 99%

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Visual support of the post-correction process has been emphasized by e.g. Vobl et al (2014) who describe a system of iterative post-correction of OCRed historical text which is evaluated in an application-oriented way. They present the human corrector with an alignment of image and OCRed text and make batch correction of the same error in the entire document possible.…”

Section: Related Workmentioning

confidence: 99%

Multi-modular domain-tailored OCR post-correction

Schulz¹,

Kuhn²

2017

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We show that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over singlehanded approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we describe the workflow we suggest for efficient text recognition and subsequent automatic and manual postcorrection.

show abstract

“…The basic list, which goes back to the IMPACT project, contains the most frequent patterns such as s:ſ, u:v, consonant doublings such as n:nn etc. The extended list was built by looking at previous profiler output in the context of our postcorrection tool PoCoTo⁵ [6], when apparent prominent OCR error patterns turned out to actually represent an ²http://reader.digitale-sammlungen.de/de/fs1/ object/display/bsb11106588_00064.html ³http://reader.digitale-sammlungen.de/de/fs1/ object/display/bsb10727266_00071.html ⁴Lüdeling, Anke; Odebrecht, Carolin; Zeldes, Amir; RIDGES-Herbology (Version ⒌0), Humboldt-Universität zu Berlin, https://www.linguistik.hu-berlin.de/en/instituten/professuren-en/korpuslinguistik/research/ ridges-projekt?set_language=en ⁵https://github.com/cisocrgroup/PoCoTo additional historical pattern. In this way we found historical spelling patterns such as ß: (see Fig.…”

Section: Evaluation Data and Principlesmentioning

confidence: 99%

Profiling of OCR'ed Historical Texts Revisited

Fink

Schulz

Springmann

2017

Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Self Cite

View full text Add to dashboard Cite

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to coǌectured errors and suspicious tokens. The method introduced in [3] computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Here we improve this method in three respects: First, the method in [3] is not adaptive: user feedback obtained by actual postcorrection steps cannot be used to compute refined profiles. We introduce a variant of the method that is open for adaptivity, taking correction steps of the user into account. This leads to higher precision with respect to recognition of erroneous OCR tokens. Second, during postcorrection often new historical patterns are found. We show that adding new historical patterns to the linguistic background resources leads to a second kind of improvement, enabling even higher precision by telling historical spellings apart from OCR errors. Third, the method in [3] does not make any active use of tokens that cannot be interpreted in the underlying channel model. We show that adding these uninterpretable tokens to the set of coǌectured errors leads to a significant improvement of the recall for error detection, at the same time improving precision.

show abstract

PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts

Cited by 24 publications

References 11 publications

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Multi-modular domain-tailored OCR post-correction

Profiling of OCR'ed Historical Texts Revisited

Contact Info

Product

Resources

About