An Efficient Language-Independent Multi-Font OCR for Arabic Script

Osman, Hussein Al; Zaghw, Karim; Hazem, Mostafa; Elsehely, Seifeldin

doi:10.48550/arxiv.2009.09115

Cited by 2 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Publicly available scanned image datasets are tested by [40], e.g., the WATAN and APTI datasets with extensive vocabularies. The datasets are split into a training set and a testing set, where training data contain 282,000 word images and 1,200,000 characters images while testing 5500 words, and 100,500 characters are used.…”

Section: Datasetmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

Optical character recognition (OCR) is the process of extracting handwritten or printed text from a scanned or printed image and converting it to a machine-readable form for further data processing, such as searching or editing. Automatic text extraction using OCR helps to digitize documents for improved productivity and accessibility and for preservation of historical documents. This paper provides a survey of the current state-of-the-art applications, techniques, and challenges in Arabic OCR. We present the existing methods for each step of the complete OCR process to identify the best-performing approach for improved results. This paper follows the keyword-search method for reviewing the articles related to Arabic OCR, including the backward and forward citations of the article. In addition to state-of-art techniques, this paper identifies research gaps and presents future directions for Arabic OCR.

show abstract

Section: Datasetmentioning

confidence: 99%

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

et al. 2023

View full text Add to dashboard Cite

show abstract

“…In some cases, the secondary ligature may not exist. Different Urdu characters join with each other and form different ligatures based on joiner rules [29]. These ligatures often take the form of diacritics like dots or dashes that are appended somewhere with the character.…”

Section: F Post-processingmentioning

confidence: 99%

“…These secondary ligatures are stored separately in a list and are marked during the entire training process. When the training procedure invokes the distance scaling module, we allot specific weights to different ligatures based on the joiner rules [29]. These weights allow the model to separate similar characters with only a discriminating diacritic, which helps in improving the per-class classification accuracy and the subsequent overall accuracy.…”

Section: F Post-processingmentioning

confidence: 99%

An Enhanced Prototypical Network Architecture for Few-Shot Handwritten Urdu Character Recognition

Sahay

Coustaty

2023

IEEE Access

View full text Add to dashboard Cite

Few shot models have started to gain a lot of popularity in the past few years. This is mostly because these models grant the ability to structure the representation space (classes) using a very less amount of examples for each class. Such models are usually trained on a wide range of different classes and their examples, which allows them to form and learn a decision-based metric in the process. Non-Latin languages, especially languages such as Urdu, have a bi-linear direction of writing and are context-sensitive in nature, and are hard to recognize. Also, unlike traditional English, there is a very small amount of clean, collated, and usable data that is available for the Urdu language. In this paper, we explore a prototypical network for k-shot classification on handwritten Urdu characters. The prototypical network learns the Euclidean embeddings of the provided images and uses clusters to classify newer examples. Our improved method is able to outperform other methods of few-shot learning and is able to accurately classify both Urdu characters as well as numerals using a minimal number of examples. After comprehensive qualitative and quantitative evaluation and comparison of our proposed approach with other methods to classify handwritten text in few-shot settings, we found out that our proposed approach was typically able to beat other methods by a margin of 1% − 2% while relying on a small training set.

show abstract

An Efficient Language-Independent Multi-Font OCR for Arabic Script

Cited by 2 publications

References 21 publications

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

An Enhanced Prototypical Network Architecture for Few-Shot Handwritten Urdu Character Recognition

Contact Info

Product

Resources

About