A survey on Urdu and Urdu like language stemmers and stemming techniques

Jabbar, Abdul; Iqbal, Sajid; Khan, Muhammad Usman Ghani; Hussain, Shafiq

doi:10.1007/s10462-016-9527-1

Cited by 17 publications

(7 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3. A document is tokenized using space or punctuation symbols [14], [37]. Non-language characters, special symbols, numeric values, and URLs are removed so that a document contains only words of the target language.…”

Section: ) Preprocessing Of Text Documentsmentioning

confidence: 99%

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

et al. 2020

View full text Add to dashboard Cite

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into predefined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multipurpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing. INDEX TERMS Convolutional neural network, deep learning, machine learning, natural language processing, text document classification, Urdu text classification.

show abstract

Section: ) Preprocessing Of Text Documentsmentioning

confidence: 99%

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The agglutination nature of the Urdu language means that the prefix, lemma, and suffix are added to the root (stem) word with multiple different combinations making a more complicated word structure (morphology) [ 42 ]. A token may either change a word’s NE type or the word may not be classified as NEs when agglutinated with other words.…”

Section: Challenges Of Urdu Named Entity Recognitionmentioning

confidence: 99%

A deep learning approach for Named Entity Recognition in Urdu language

Anam,

Anwar,

Jamal

et al. 2024

PLoS ONE

View full text Add to dashboard Cite

Named Entity Recognition (NER) is a natural language processing task that has been widely explored for different languages in the recent decade but is still an under-researched area for the Urdu language due to its rich morphology and language complexities. Existing state-of-the-art studies on Urdu NER use various deep-learning approaches through automatic feature selection using word embeddings. This paper presents a deep learning approach for Urdu NER that harnesses FastText and Floret word embeddings to capture the contextual information of words by considering the surrounding context of words for improved feature extraction. The pre-trained FastText and Floret word embeddings are publicly available for Urdu language which are utilized to generate feature vectors of four benchmark Urdu language datasets. These features are then used as input to train various combinations of Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), CRF, and deep learning models. The results show that our proposed approach significantly outperforms existing state-of-the-art studies on Urdu NER, achieving an F-score of up to 0.98 when using BiLSTM+GRU with Floret embeddings. Error analysis shows a low classification error rate ranging from 1.24% to 3.63% across various datasets showing the robustness of the proposed approach. The performance comparison shows that the proposed approach significantly outperforms similar existing studies.

show abstract

“…This makes Urdu a complex and highly rich morphological language. Further, it is one of the most important languages in South Asia, as it is spoken by more than 175 million people in Pakistan, India, and other South Asian countries [3][4][5].…”

Section: Introductionmentioning

confidence: 99%

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

2023

View full text Add to dashboard Cite

Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach.

show abstract

A survey on Urdu and Urdu like language stemmers and stemming techniques

Cited by 17 publications

References 35 publications

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

A deep learning approach for Named Entity Recognition in Urdu language

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Contact Info

Product

Resources

About