Multi-modular domain-tailored OCR post-correction

2019 International Conference on Document Analysis and Recognition (ICDAR)

Jatowt

Coustaty

et al. 2019

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level (such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

Section: Mixed Error Detectionmentioning

confidence: 99%

Post-OCR Error Detection by Generating Plausible Candidates

2019 International Conference on Document Analysis and Recognition (ICDAR)

Jatowt

Coustaty

et al. 2019

“…In contrast, our model is easy to implement with available data. Moreover, it should be emphasized that our improvement is much higher than the neural MT based approach (CLAM) or statistical MT based one (MMDT) [16]. Consequently, we think that our model can be considered as a reliable solution to reduce OCR errors across various data sets.…”

Section: Monograph Periodicalmentioning

confidence: 85%

“…Others (e.g. Char-SMT/NMT [1], MMDT [16], CLAM, CCC, UVA -competition teams [3,14]) use machine translation techniques in order to transform OCRed text into corrected one.…”

Section: Related Workmentioning

confidence: 99%

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Jatowt

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

et al. 2020

The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (e.g., semantic) information access even harder. This work proposes a novel post-OCR approach based on a contextual language model and neural machine translation, aiming to improve the quality of OCRed text by detecting and rectifying erroneous tokens. This new technique obtains results comparable to the best-performing approaches on English datasets of the competition on post-OCR text correction in ICDAR 2017/2019.

“…MMDT [18] approach combined many modules from word level (Original words, Spell checker, Compounder, Word splitter, Text-Internal Vocabulary) to sentence level (Statistical Machine Translation) for candidate suggestion. Then, the decision module of Moses decoder [13] was used to rank candidates.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

Lecture Notes in Computer Science

Coustaty

Doucet

et al. 2018

Post-processing is a crucial step in improving the performance of OCR process. In this paper, we present a novel approach which explores a modified way of candidate generating and candidate scoring at character level as well as word level. These features are combined with some important features suggested by related work for ranking candidates in a regression model. The experimental results show that our approach has comparable results with the top performing approaches in the Post-OCR text correction competition ICDAR 2017.