2020
DOI: 10.48550/arxiv.2009.09115
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An Efficient Language-Independent Multi-Font OCR for Arabic Script

Abstract: Optical Character Recognition (OCR) is the process of extracting digitized text from images of scanned documents. While OCR systems have already matured in many languages, they still have shortcomings in cursive languages with overlapping letters such as the Arabic language. This paper proposes a complete Arabic OCR system that takes a scanned image of Arabic Naskh script as an input and generates a corresponding digital document. Our Arabic OCR system consists of the following modules: Pre-processing, Word-le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…Publicly available scanned image datasets are tested by [40], e.g., the WATAN and APTI datasets with extensive vocabularies. The datasets are split into a training set and a testing set, where training data contain 282,000 word images and 1,200,000 characters images while testing 5500 words, and 100,500 characters are used.…”
Section: Datasetmentioning
confidence: 99%
“…Publicly available scanned image datasets are tested by [40], e.g., the WATAN and APTI datasets with extensive vocabularies. The datasets are split into a training set and a testing set, where training data contain 282,000 word images and 1,200,000 characters images while testing 5500 words, and 100,500 characters are used.…”
Section: Datasetmentioning
confidence: 99%
“…In some cases, the secondary ligature may not exist. Different Urdu characters join with each other and form different ligatures based on joiner rules [29]. These ligatures often take the form of diacritics like dots or dashes that are appended somewhere with the character.…”
Section: F Post-processingmentioning
confidence: 99%
“…These secondary ligatures are stored separately in a list and are marked during the entire training process. When the training procedure invokes the distance scaling module, we allot specific weights to different ligatures based on the joiner rules [29]. These weights allow the model to separate similar characters with only a discriminating diacritic, which helps in improving the per-class classification accuracy and the subsequent overall accuracy.…”
Section: F Post-processingmentioning
confidence: 99%