A complete OCR for printed Hindi text in Devanagari script

Bansal, Veena; Sinha, R.M.K.

doi:10.1109/icdar.2001.953898

Cited by 53 publications

(21 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bansal and Sinha [2] had proposed a segmentation technique for Devanagari script wherein word image is divided into top, core and bottom strips. Top strip is separated from core strip by a header line.…”

Section: Figure 5: Identification Of Break Locations On Imagesmentioning

confidence: 99%

Towards recognition of degraded words by probabilistic parsing

Mohan

Jinesh

Jawahar

2010

Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing

View full text Add to dashboard Cite

Though, Indian language OCRs have shown significant improvement in classification rates in recent years, recognition of degraded words still pose a big challenge for the development of robust OCR systems. Ours is an attempt to formulate the problem of degraded word recognition in a generic and formal structure. We formulate the problem of degraded word recognition as a probabilistic parsing problem. A probabilistic parsing based framework is used to rank and validate various possible hypotheses. We effectively combine it with an alternate word generator, symbol recognizer and verification unit to improve recognition rates of degraded words without compromising good characters. We demonstrate our method on Malayalam. We experiment our method on a complete annotated book, where around 65% of the degraded words are correctly recognized using this approach.

show abstract

Section: Figure 5: Identification Of Break Locations On Imagesmentioning

confidence: 99%

Towards recognition of degraded words by probabilistic parsing

Mohan

Jinesh

Jawahar

2010

Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing

View full text Add to dashboard Cite

show abstract

“…Research interest in Latin-based OCR faded away more than a decade ago, in favor of Chinese, Japanese, and Korean (CJK) [1,2], followed more recently by Arabic [3,4], and then Hindi [5,6]. These languages provide greater challenges specifically to classifiers, and also to the other components of OCR systems.…”

Section: Introductionmentioning

confidence: 99%

Adapting the Tesseract open source OCR engine for multilingual OCR

Smith

Antonova

Lee

2009

Proceedings of the International Workshop on Multilingual OCR

108

View full text Add to dashboard Cite

show abstract

“…While Tesseract was originally developed for English, it has since been extended to recognize French, Italian, Catalan, Czech, Danish, Polish, Bulgarian, Russian, Greek, Korean, Spanish, Japanese, Dutch, Chinese, Indonesian, Swedish, German, Thai, Arabic, and Hindi etc. Training the Tesseract OCR Engine for Hindi language requires in-depth knowledge of Devnagari script in order to collect the character set [4]. Moreover, Tesseract OCR Engine does not just require training of the collected dataset but also to tackle the character segmentation and clubbing issues based on the script specific features [5] i.e.…”

Section: Introductionmentioning

confidence: 99%

Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

Mishra¹,

Patvardhan²,

Lakshmi³

et al. 2012

IJCA

View full text Add to dashboard Cite

Tesseract OCR Engine is one of the most efficient open source OCR engines currently available. Recently, Tesseract OCR 3.01 is capable of recognizing Hindi language but still it needs some enhancement to improve the performance. The Hindi language recognition accuracy is quite low even for the printed text, as the conjunct character combinations of Hindi Language are not easily separable due to partial overlapping. The proposed approach solves this problem, so that Devanagari conjunct characters can easily be segmented and recognized using Tesseract OCR Engine. This paper presents a complete methodology to improve The Hindi Language Recognition accuracy. This paper also presents comparison with other Devanagari OCR engines available on the basis of recognition accuracy, processing time, font variations and database size. General TermsPattern Recognition

show abstract

A complete OCR for printed Hindi text in Devanagari script

Cited by 53 publications

References 9 publications

Towards recognition of degraded words by probabilistic parsing

Towards recognition of degraded words by probabilistic parsing

Adapting the Tesseract open source OCR engine for multilingual OCR

Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

Contact Info

Product

Resources

About