Improving Arabic Text Classification Using P-Stemmer

Kanan, Tarek; Hawashin, Bilal; AlZu’bi, Shadi; Almaita, Eyad; Alkhatib, Ahmad Aa; Maria, Khulood Abu; Elbes, Mohammed

doi:10.2174/2666255813999200904114023

Cited by 15 publications

(7 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Stopwords account for around 20%-30% of a document's exhaustive words. These terms can be deleted since they are repetitive [28]. The basic approach for extracting stopwords is static, meaning it uses a pre-filled list of all words that are semantically irrelevant to a specific language.…”

Section: Preprocessingmentioning

confidence: 99%

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Masadeh,

et al. 2024

IJACSA

View full text Add to dashboard Cite

Arabic Text Classification (ATC) is a crucial step for various Natural Language Processing (NLP) applications. It emerged as a response to the exponential growth of online content like social posts and review comments. In this study, preprocessing techniques and representation models are used to evaluate the effectiveness of ATC using Machine Learning (ML). Generally, the ATC operation depends on various factors, such as stemming in preprocessing, feature extraction and selection, and the nature of the dataset. To enhance the overall classification performance, preprocessing methodologies are primarily employed to transform each Arabic term into its root form and reduce the dimensionality of representation. In the representation of Arabic text, feature extraction and selection processes are imperative, as they significantly enhance the performance of ATC. This study implements the chosen classifiers using various feature selection algorithms. The comprehensive assessment of classification outcomes is conducted by comparing various classifiers, including Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and linear Support Vector Classifier (LSVC). These ML classifiers are assessed utilizing short and long Arabic text benchmark datasets called BBC Arabic corpus and the COVID-19 dataset. The assessment findings indicate that the efficacy of classification is significantly influenced by the preprocessing methods, representation model, classification algorithm, and the datasets' characteristics. In most cases, the SGDC and LSVC have consistently surpassed other classifiers for the datasets under consideration when significant features are chosen.

show abstract

Section: Preprocessingmentioning

confidence: 99%

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Masadeh,

et al. 2024

IJACSA

View full text Add to dashboard Cite

show abstract

“…The author [13] proposes an approach to improve P-Stemmer by combining it with various classifiers such as Naïve Bayes, Random Forest, Support Vector Machines, K-Nearest Neighbor, and K-Star. In this study they used a data set synthesized from various online news pages and did the experience on Weka tools, which is achieving the result showed that the P stemmer has Improved when using NB.…”

Section: Related Workmentioning

confidence: 99%

Investigate the Impact of Stemming on Mauritanian Dialect Classification using Machine Learning Techniques

CHRIF,

Seyed,

Mahmoud

et al. 2023

IJACSA

View full text Add to dashboard Cite

Despite the plethora and diversity of research on Natural Language Processing (NLP). As a technique allowing computers to understand, generate, and manipulate human language; It still remains insufficient, especially with regard to the processing of Arabic texts and their dialects which are widely used. The proposed approach focuses on the application of machine learning techniques taking into account evaluation criteria such as training to comments expressed in Mauritanian dialect, published on social media notably Facebook, and compares results generated by three algorithms which we applied such as the Random Forest (RF), Naïve Bayes Multinominal (NBM), and Logistic Regression (LR) algorithm. Additionally, We then study the effect of machine learning techniques when different stemmers are combined with other features such as the tokenizers used to process the dataset. Although major challenges exist such as the morphology of Arabic is completely different from Latin letter languages, and there is no pre-existing dataset or dictionary to train the algorithms, the result we obtained after the experiments carried out on Weka shows that the RF and NBM algorithms are more efficient when applied with ArbicStemmerKhoja giving results respectively 96.37% and 71.40%; However, Logistic gets better performance results with Null Stemme is 81.65%. Results obtained by the three techniques applied with a light Arabic stemmer were more than 70%. This article presents a contribution to NLP based on Machine learning, descript also an important study that can determine the best Arabic classifier.

show abstract

“…Heuristic techniques have been widely applied to perform data classification tasks [ 32 , 33 ]. A heuristic for one dataset may not be equally effective for another dataset [ 34 ].…”

Section: Related Workmentioning

confidence: 99%

EHHR: an efficient evolutionary hyper-heuristic based recommender framework for short-text classifier selection

2022

View full text Add to dashboard Cite

With various machine learning heuristics, it becomes difficult to choose an appropriate heuristic to classify short-text emerging from various social media sources in the form of tweets and reviews. The No Free Lunch theorem asserts that no heuristic applies to all problems indiscriminately. Regardless of their success, the available classifier recommendation algorithms only deal with numeric data. To cater to these limitations, an umbrella classifier recommender must determine the best heuristic for short-text data. This paper presents an efficient reminisce-enabled classifier recommender framework to recommend a heuristic for new short-text data classification. The proposed framework, “Efficient Evolutionary Hyper-heuristic based Recommender Framework for Short-text Classifier Selection (EHHR),” reuses the previous solutions to predict the performance of various heuristics for an unseen problem. The Hybrid Adaptive Genetic Algorithm (HAGA) in EHHR facilitates dataset-level feature optimization and performance prediction. HAGA reveals that the influential features for recommending the best short-text heuristic are the average entropy, mean length of the word string, adjective variation, verb variation II, and average hard examples. The experimental results show that HAGA is 80% more accurate when compared to the standard Genetic Algorithm (GA). Additionally, EHHR clusters datasets and rank heuristics cluster-wise. EHHR clusters 9 out of 10 problems correctly.

show abstract

Improving Arabic Text Classification Using P-Stemmer

Cited by 15 publications

References 15 publications

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Investigate the Impact of Stemming on Mauritanian Dialect Classification using Machine Learning Techniques

EHHR: an efficient evolutionary hyper-heuristic based recommender framework for short-text classifier selection

Contact Info

Product

Resources

About