Safeya Mamish scite author profile

Safeya Mamish

1Publication

0Citation Statements Received

24Citation Statements Given

How they've been cited

How they cite others

Affiliations

École de Technologie Supérieure

Publications

Order By: Most citations

Correcting Arabic OCR Errors Using Improved Topic-Based Language Models

Mamish

Cheriet

2009

Int. J. Comp. Proc. Lang.

View full text Add to dashboard Cite

The OCR output of scanned document images suffers from recognition errors especially when dealing with languages that are characterized by particularities and rich morphology such as the Arabic language, thus an effective error correction model is greatly needed. This paper focuses on three aspects of post-processing correction. First, improving the alignment and error n-gram models by adding correction rules based on character meta-classes rather than on specific characters, which is more suitable for the Arabic language. Second, using the language models to understand and correct the Arabic word fragment resulting from agglutinated affixes or isolated letters. The last will concern improving the language models by adding semantic information to the correction process, by using the bidirectional n-grams, stemming and removing stop words, which gives higher weights to n-grams sharing semantic meanings. In addition, we use a topic corpus, not a global one for a better probability distribution. The proposed model is effective in correcting the lexical errors and covered the semantic ones, that were not frequently reported by OCRs and are corrected after a manual proofreading. The proposed method shows an increase in the correction rate of almost 13% especially in meaningful terms.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Safeya Mamish

Correcting Arabic OCR Errors Using Improved Topic-Based Language Models

Contact Info

Product

Resources

About