Segmentation-free optical character recognition for printed Urdu text

Din, Israr Ud; Siddiqi, Imran; Khalid, Shehzad; Azam, Tahir

doi:10.1186/s13640-017-0208-z

Cited by 35 publications

(10 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their proposed system is trained on the UPTI dataset using Multidimensional-LSTM Recurrent Network that has attained 98% of accuracy on Nastaleeq Urdu Font. To recognize Urdu text, Israr Ud Din et al [2] presented a holistic approach for the recognition of printed Urdu text in Nastaleeq font. They have extracted 9 different statistical features with cumulative dimensionality of 116 for each sub-word image using a sliding window from left to right.…”

Section: Literature Reviewmentioning

confidence: 99%

Optical Character Recognition of Urdu Text using Histogram of Oriented Gradient Features

Awais

Iqbal

Rasool

et al. 2022

Preprint

View full text Add to dashboard Cite

Optical character recognition has received significant research focus to digitize the text in images. Urdu OCR is a difficult task as compared to English and similar languages due to its complex nature where a character can have multiple inflections depending upon its position in the word. The proposed research work presents segmentation-free approach (i.e. holistic approach) for offline Urdu printed text detection. To extract text lines in an image, horizontal histogram projection is employed whereas for ligature segmentation in extracted image text line, proposed technique has used connected components labelling. In this model, set of 14 statistical features along with HOG features are extracted for each sub-word/ligature and used for the training of the proposed model. An open-source dataset UPTI [10] has been used to train and test the proposed algorithm. SVM with RBF kernel function is used for the classification of ligatures. The proposed algorithm has achieved 97.3% character recognition rate on given dataset.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Optical Character Recognition of Urdu Text using Histogram of Oriented Gradient Features

Awais

Iqbal

Rasool

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There are a limited number of benchmark datasets for Perso-Arabic scripts. Some of them have been presented here: UPTI: Urdu Printed Text Image dataset, used by [19], [20], [24], [26] [30]. Although the first dataset of its kind till that time, this handwritten offline dataset has a limited number of data samples, 44 individual characters, and 57 Urdu words, focusing on one field mostly, the financial terms.…”

Section: Perso-arabic Datasetsmentioning

confidence: 99%

An Insight for Cursive Context-Specific Printed Script Recognition

Rafique¹,

Javid²

2021

Preprint

View full text Add to dashboard Cite

The greatest challenge of machine learning problems is to select suitable techniques and resources such as tools and datasets. Despite the existence of millions of speakers around the globe and the rich literary history of more than a thousand years, it is expensive to find the computational linguistic work related to Punjabi Shahmukhi script, a member of the Perso-Arabic context-specific script low-resource language family. This paper presents a deep insight into the related work with summary statistics, advocating the popularity and success of artificial neural networks and related techniques. The paper includes support from recent trends from the authentic sources based on the top-level researchers' feedback including the machine learning frameworks. A comprehensive comparison of the most popular deep learning techniques convolutional neural network and the recursive neural network has been presented for the cursive context-specific scripts of Perso-Arabic nature. The overview of the available benchmark datasets for machine learning problems, especially for the Perso-Arabic group, is added. This paper incorporates essential knowledge contents for the researchers in machine learning and natural language processing disciplines on the selection of algorithms, architectures, and resources.

show abstract

“…Due to challenges already discussed, implicit segmentation‐based techniques have remained a popular choice of researchers [72–75]. Likewise, in the case of holistic approaches, ligatures have been typically employed as recognition units [76].…”

Section: Related Workmentioning

confidence: 99%

“…These techniques use the sliding windows to extract features from ligature images which are projected in the quantised feature space hence representing each ligature image as a sequence. In some cases, the main body and dots are separately recognised [76] to reduce the total number of unique classes which can be very high in case of Urdu text (Urdu has more than 26,000 unique ligatures [84]). A number of holistic techniques are based on word spotting [85, 86] rather than recognition, to retrieve documents containing words similar to those provided as a query.…”

Section: Related Workmentioning

confidence: 99%