2021
DOI: 10.11591/ijeecs.v21.i1.pp233-241
|View full text |Cite
|
Sign up to set email alerts
|

Corpus-based technique for improving Arabic OCR system

Abstract: <p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and corre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…If it matches, then the model moves to the next word; otherwise, the error model again suggests the new token provided by Google's spelling suggestion algorithm, as shown in Figure 8. The authors of [102] present a corpus-based technique for improving the performance of an Arabic OCR system. The method involves using a large corpus of texts in Arabic to train the OCR system and improve its recognition accuracy.…”
Section: Postprocessingmentioning
confidence: 99%
“…If it matches, then the model moves to the next word; otherwise, the error model again suggests the new token provided by Google's spelling suggestion algorithm, as shown in Figure 8. The authors of [102] present a corpus-based technique for improving the performance of an Arabic OCR system. The method involves using a large corpus of texts in Arabic to train the OCR system and improve its recognition accuracy.…”
Section: Postprocessingmentioning
confidence: 99%
“…Khosrobeigi et al [42] applied a context-based post-processing technique to improve Persian OCR performance and increased accuracy by 93%. Aliwy and Al-Sadawi [43] aimed to reduce noise in character recognition by using the combined text of corpus files. In addition, they constructed a dictionary and N-gram language model for detecting and correcting errors at the post-processing step, and results improved up to 98%.…”
Section: Figure 2 Example Of a Thai Word With Three Elements: Letters...mentioning
confidence: 99%