A filter based post-OCR accuracy boost system

Borovikov, Eugene; Zavorin, Ilya; Turner, Mark

doi:10.1145/1031442.1031446

Cited by 12 publications

(12 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They propose an efficient sampling algorithm that accounts for a noisy data and get more representative sample. In [8], they propose a technique that is based on hidden Markov model that aims at minimizing the noise output of the OCR. They measure their solution against a base-line solution that is based solely on spelling checker to remove noise.…”

Section: Related Workmentioning

confidence: 99%

Marketing image categorization using hybrid human-machine combinations

Gnanasambandam

Madhu

2012

Imaging and Printing in a Web 2.0 World III

View full text Add to dashboard Cite

Marketing instruments with nested, short-form, symbol loaded content need to be studied differently. Image classification in the Web2.0 world can dynamically use a configurable amount of internal and external data as well as varying levels of crowd-sourcing. Our work is one such examination of how to construct a hybrid technique involving learning and crowd-sourcing. Through a parameter called turkmix and a multitude of crowd-sourcing techniques available we show that we can control the trend of metrics such as precision and recall on the hybrid categorizer.

show abstract

Section: Related Workmentioning

confidence: 99%

Marketing image categorization using hybrid human-machine combinations

Gnanasambandam

Madhu

2012

Imaging and Printing in a Web 2.0 World III

View full text Add to dashboard Cite

show abstract

“…They propose an efficient sampling algorithm that accounts for a noisy data and get more representative sample. In [7], they propose a technique that is based on hidden markov model that aims at minimizing the noise output of the OCR. They measure their solution against a base-line solution that is based solely on spelling checker to remove noise.…”

Section: Related Workmentioning

confidence: 99%

Image categorization for marketing purposes

2011

View full text Add to dashboard Cite

Images meant for marketing and promotional purposes (i.e. coupons) represent a basic component in incentivizing customers to visit shopping outlets and purchase discounted commodities. They also help department stores in attracting more customers and potentially, speeding up their cash flow. While coupons are available from various sources -print, web, etc. categorizing these monetary instruments is a benefit to the users. We are interested in an automatic categorizer system that aggregates these coupons from different sources (web, digital coupons, paper coupons, etc) and assigns a type to each of these coupons in an efficient manner. While there are several dimensions to this problem, in this paper we study the problem of accurately categorizing/classifying the coupons. We propose and evaluate four different techniques for categorizing the coupons namely, word-based model, n-gram-based model, externally weighing model, weight decaying model which take advantage of known machine learning algorithms. We evaluate these techniques and they achieve high accuracies in the range of 73.1% to 93.2%. We provide various examples of accuracy optimizations that can be performed and show a progressive increase in categorization accuracy for our test dataset.

show abstract

“…On the contrary, for lightweight methods, systems use probabilistic techniques and n-gram analysis, classically solved through Hidden Markov Models (HMM) or dynamic programming, first used by Neuhoff [14] in text correction. Borovikov et al [3] have built a HMM-based correction using several post-OCR filters. OCR errors were modeled in terms of a two-layer stochastic process to deal with known and observed characters.…”

Section: State-of-the-art Of Natural Scene Ocr Correctionmentioning

confidence: 99%

A Weighted Finite-State Framework for Correcting Errors in Natural Scene OCR

Beaufort

Mancas-Thillou

2007

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2

View full text Add to dashboard Cite

With the increasing market of cheap cameras, natural scene text has to be handled in an efficient way. Some works deal with text detection in the image while more recent ones point out the challenge of text extraction and recognition. We propose here an OCR correction system to handle traditional issues of recognizer errors but also the ones due to natural scene images, i.e. cut characters, artistic display, uncomplete sentences (present in advertisements) and outof-vocabulary (OOV) words such as acronyms and so on. The main algorithm bases on Finite-State Machines (FSMs) to deal with learned OCR confusions, capital/accented letters and lexicon look-up. Moreover, as OCR is not considered as a black box, several outputs are taken into accountto intermingle recognition and correction steps. Based on a public database of natural scene words, detailed results are also presented along with future works.

show abstract

A filter based post-OCR accuracy boost system

Cited by 12 publications

References 4 publications

Marketing image categorization using hybrid human-machine combinations

Marketing image categorization using hybrid human-machine combinations

Image categorization for marketing purposes

A Weighted Finite-State Framework for Correcting Errors in Natural Scene OCR

Contact Info

Product

Resources

About