Corpus-based technique for improving Arabic OCR system

Aliwy, Ahmed H.; Al-Sadawi, Basheer

doi:10.11591/ijeecs.v21.i1.pp233-241

Cited by 5 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If it matches, then the model moves to the next word; otherwise, the error model again suggests the new token provided by Google's spelling suggestion algorithm, as shown in Figure 8. The authors of [102] present a corpus-based technique for improving the performance of an Arabic OCR system. The method involves using a large corpus of texts in Arabic to train the OCR system and improve its recognition accuracy.…”

Section: Postprocessingmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

Optical character recognition (OCR) is the process of extracting handwritten or printed text from a scanned or printed image and converting it to a machine-readable form for further data processing, such as searching or editing. Automatic text extraction using OCR helps to digitize documents for improved productivity and accessibility and for preservation of historical documents. This paper provides a survey of the current state-of-the-art applications, techniques, and challenges in Arabic OCR. We present the existing methods for each step of the complete OCR process to identify the best-performing approach for improved results. This paper follows the keyword-search method for reviewing the articles related to Arabic OCR, including the backward and forward citations of the article. In addition to state-of-art techniques, this paper identifies research gaps and presents future directions for Arabic OCR.

show abstract

Section: Postprocessingmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Khosrobeigi et al [42] applied a context-based post-processing technique to improve Persian OCR performance and increased accuracy by 93%. Aliwy and Al-Sadawi [43] aimed to reduce noise in character recognition by using the combined text of corpus files. In addition, they constructed a dictionary and N-gram language model for detecting and correcting errors at the post-processing step, and results improved up to 98%.…”

Section: Figure 2 Example Of a Thai Word With Three Elements: Letters...mentioning

confidence: 99%

Automated Data Digitization System for Vehicle Registration Certificates Using Google Cloud Vision API

Thammarak

Sirisathitkul²,

Kongkla³

et al. 2022

Civ Eng J

View full text Add to dashboard Cite

This study aims to develop an automated data digitization system for the Thai vehicle registration certificate. It is the first system developed as a web service Application Programming Interface (API), which is essential for any enterprise to increase its business value. Currently, this system is available on “www.carjaidee.com”. The system involves four steps: 1) an embedded frame aligns a document to be correctly recognised in the image acquisition step; 2) sharpening and brightness filtering techniques to enhance image quality are applied in the pre-processing step; 3) the Google Cloud Vision API receives a prompt to proceed in the recognition step; 4) a specific domain dictionary to improve accuracy rate is developed for the post-processing step. This study defines 92 images for the experiment by counting the correct words and terms from the output. The findings suggest that the proposed method, which had an average accuracy of 93.28%, was significantly more accurate than the original method using only the Google Cloud Vision API. However, the system is limited because the dictionaries cannot automatically recognise a new word. In the future, we will explore solutions to this problem using natural language processing techniques. Doi: 10.28991/CEJ-2022-08-07-09 Full Text: PDF

show abstract