The OCR output of scanned document images suffers from recognition errors especially when dealing with languages that are characterized by particularities and rich morphology such as the Arabic language, thus an effective error correction model is greatly needed. This paper focuses on three aspects of post-processing correction. First, improving the alignment and error n-gram models by adding correction rules based on character meta-classes rather than on specific characters, which is more suitable for the Arabic language. Second, using the language models to understand and correct the Arabic word fragment resulting from agglutinated affixes or isolated letters. The last will concern improving the language models by adding semantic information to the correction process, by using the bidirectional n-grams, stemming and removing stop words, which gives higher weights to n-grams sharing semantic meanings. In addition, we use a topic corpus, not a global one for a better probability distribution. The proposed model is effective in correcting the lexical errors and covered the semantic ones, that were not frequently reported by OCRs and are corrected after a manual proofreading. The proposed method shows an increase in the correction rate of almost 13% especially in meaningful terms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.