Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1288
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modular domain-tailored OCR post-correction

Abstract: One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of li… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(33 citation statements)
references
References 21 publications
0
32
0
1
Order By: Relevance
“…Two approaches (CLAM, Char-SMT/NMT) applied MT techniques at character level to detect and correct OCR errors. Another approach MMDT [11] combined many modules for candidate suggestion. Then, the decision module of MT technique was used to rank candidates.…”
Section: Mixed Error Detectionmentioning
confidence: 99%
“…Two approaches (CLAM, Char-SMT/NMT) applied MT techniques at character level to detect and correct OCR errors. Another approach MMDT [11] combined many modules for candidate suggestion. Then, the decision module of MT technique was used to rank candidates.…”
Section: Mixed Error Detectionmentioning
confidence: 99%
“…In contrast, our model is easy to implement with available data. Moreover, it should be emphasized that our improvement is much higher than the neural MT based approach (CLAM) or statistical MT based one (MMDT) [16]. Consequently, we think that our model can be considered as a reliable solution to reduce OCR errors across various data sets.…”
Section: Monograph Periodicalmentioning
confidence: 85%
“…Others (e.g. Char-SMT/NMT [1], MMDT [16], CLAM, CCC, UVA -competition teams [3,14]) use machine translation techniques in order to transform OCRed text into corrected one.…”
Section: Related Workmentioning
confidence: 99%
“…MMDT [18] approach combined many modules from word level (Original words, Spell checker, Compounder, Word splitter, Text-Internal Vocabulary) to sentence level (Statistical Machine Translation) for candidate suggestion. Then, the decision module of Moses decoder [13] was used to rank candidates.…”
Section: Related Workmentioning
confidence: 99%