Learning string distance with smoothing for OCR spelling correction

Hládek, Daniel; Staš, Ján; Ondáš, Stanislav; Juhár, Jozef; Kovács, László

doi:10.1007/s11042-016-4185-5

Cited by 12 publications

(7 citation statements)

References 38 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, the similarity between crawled pages and the given input text is controlled based on the normalised cosine distance. For each token in the input, they select lexical words whose LV distances to the input Poncelas et al [124], Hládek et al [63], and Généreux et al [52] employ similar techniques to detect errors and suggest correction candidates. Particularly, they detect noisy tokens by a lexiconlookup and select candidates based on LV distances between a given error and lexicon-entries.…”

Section: Context-dependent Approachesmentioning

confidence: 99%

“…Poncelas et al [124] rank the correction suggestions based on word 5-gram language model built from Europarl-v9 corpus. 9 Hládek et al [63] use HMM with state transition probability as word bigram language model probability and observation probability as their smoothing string distance for choosing the best candidate. Généreux et al [52] choose the most probable candidate by a sum of the following feature values: confusion weight, candidate frequency, and bigram frequency.…”

Section: Context-dependent Approachesmentioning

confidence: 99%

“…We present here a list of open sources: a free cloud service for OCR [16], 25 Hämäläinen and Hengchen [58], 26 Ochre, 27 OpenOCRCorrect [138], 28 PoCoTo [159], 29 Schnober et al [141], 30 Silfverberg et al [145], 31 Smith [41], 32 Soni et al [147], 33 and Schaefer and Neudecker [140]. 34 (4) Statistical Machine Translation: Moses SMT [80], 63 GIZA++, 64 ISI ReWrite Decoder. 65 (5) Neural Machine Translation: Nematus [144], 66 OpenNMT [79].…”

Section: Language Resources and Toolkitsmentioning

confidence: 99%

See 2 more Smart Citations

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

show abstract

Section: Context-dependent Approachesmentioning

confidence: 99%

Section: Context-dependent Approachesmentioning

confidence: 99%

Section: Language Resources and Toolkitsmentioning

confidence: 99%

See 1 more Smart Citation

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Spelling correction is part of postprocessing of the digitized document because OCR systems are usually proprietary and difficult to adapt. Typical error patterns appear in OCR texts [8]. The standard set for evaluation of an OCR spelling correction system is the TREC-5 Confusion Track [9].…”

Section: Spelling Errorsmentioning

confidence: 99%

“…If the training corpus is sparse (which it almost always is), the learning process brings the problem of overfitting. Hládek et al [8] proposed a method for smoothing parameters in a letter-confusion matrix. Bilenko and Mooney [149] extended string-distance learning with an affine gap penalty (allowing for random sequences of characters to be skipped).…”

Section: Learning String Metricsmentioning

confidence: 99%

Survey of Automatic Spelling Correction

2020

Self Cite

View full text Add to dashboard Cite

Automatic spelling correction has been receiving sustained research attention. Although each article contains a brief introduction to the topic, there is a lack of work that would summarize the theoretical framework and provide an overview of the approaches developed so far. Our survey selected papers about spelling correction indexed in Scopus and Web of Science from 1991 to 2019. The first group uses a set of rules designed in advance. The second group uses an additional model of context. The third group of automatic spelling correction systems in the survey can adapt its model to the given problem. The summary tables show the application area, language, string metrics, and context model for each system. The survey describes selected approaches in a common theoretical framework based on Shannon’s noisy channel. A separate section describes evaluation methods and benchmarks.

show abstract

Reproducible Research in Document Analysis and Recognition

Cacho

Taghva

2018

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Learning string distance with smoothing for OCR spelling correction

Cited by 12 publications

References 38 publications

Survey of Post-OCR Processing Approaches

Survey of Post-OCR Processing Approaches

Survey of Automatic Spelling Correction

Reproducible Research in Document Analysis and Recognition

Contact Info

Product

Resources

About