Context-Sensitive Error Correction: Using Topic Models to Improve OCR

Wick, Michael; Ross, Michael G.; Learned-Miller, Erik

doi:10.1109/icdar.2007.4377099

Cited by 22 publications

(14 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We believe that further improvements can be achieved by using the clean lists in conjunction with more sophisticated models, such as document-specific language models, as suggested by [19]. In addition, we believe that the clean lists can also be used to re-segment and fix the large percentage of initial errors that result from incorrect character segmentation.…”

Section: Resultsmentioning

confidence: 99%

Improving state-of-the-art OCR through high-precision document-specific modeling

Kae

Huang

Doersch

et al. 2010

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 99%

Improving state-of-the-art OCR through high-precision document-specific modeling

Kae

Huang

Doersch

et al. 2010

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…People-LDA model [23] combined hyper-feature based face identifier and LDA model to center topics around people. Wick et al [24] used topic models to automatically detect and represent an articles semantic context for OCR improvement.…”

Section: Related Workmentioning

confidence: 99%

Latent Style Model: Discovering writing styles for calligraphy works

Zhuang¹,

Lü²,

Wu³

2009

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

“…In the past, most of the studies in error detection [2], [3] have focussed on English or very few Latin languages like German. In 1992, Kukich [1] performed experimental analysis with merely few thousands of words, while the methods discussed in 2011 by Smith [4] use a corpus as large as 100 Billion words.…”

Section: Introductionmentioning

confidence: 99%

Error Detection in Highly Inflectional Languages

Sankaran

Jawahar

2013

2013 12th International Conference on Document Analysis and Recognition

View full text Add to dashboard Cite

Abstract-Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, specially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this paper, we investigate the major challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analyzing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi. Our method learns from the error patterns and SLMs.

show abstract

Context-Sensitive Error Correction: Using Topic Models to Improve OCR

Cited by 22 publications

References 7 publications

Improving state-of-the-art OCR through high-precision document-specific modeling

Improving state-of-the-art OCR through high-precision document-specific modeling

Latent Style Model: Discovering writing styles for calligraphy works

Error Detection in Highly Inflectional Languages

Contact Info

Product

Resources

About