Enhancing Arabic stemming process using resources and benchmarking tools

Jaafar, Younes; Namly, Driss; Bouzoubaa, Karim; Yousfi, Abdellah

doi:10.1016/j.jksuci.2016.11.010

Cited by 23 publications

(14 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In spite of the increasing use of stemming as a requirement or a pre-processing step in different NLP applications, there is no stemming algorithm that is 100% precise. To address this problem, dissimilar studies have been recently focussed on evaluating and comparing the performance of Arabic stemmers to provide users and researchers with answers about the most appropriate algorithm for their tasks [7,9,10]. Nevertheless, there are no definite answers to the effectiveness of stemming in stylometric authorship applications in Arabic.…”

Section: Research Questionmentioning

confidence: 99%

The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic

Omar¹,

Hamouda²

2020

IJACSA

View full text Add to dashboard Cite

The recent years have witnessed the development of numerous approaches to authorship attribution including statistical and linguistic methods. Stylometric authorship attribution, however, remains among the most widely used due to its accuracy and effectiveness. Nevertheless, many authorship problems remain unresolved in terms of Arabic. This can be attributed to different factors including linguistic peculiarities that are not usually considered in standard authorship systems. In the case of Arabic, the morphological features carry unique stylistic features that can be usefully used in testing authorship in controversial texts and writings. The hypothesis is that much of these morphological features are lost due to the execution of stemming. As such, this study is concerned with investigating the effectiveness of stemming in the stylometric applications to authorship attribution in Arabic. In so doing, three Arabic stemmers GOLD stemmer, Khoga stemmer, Light 10 stemmer are used. By way of illustration, a corpus of 2400 news articles written by different 97 authors is designed. To evaluate the effectiveness of stemming, the selected articles (both stemmed and unstemmed texts) are clustered using cluster analysis methods. Comparisons are made between clustering structures based on stemmed and unstemmed datasets. The results indicate that stemming has negative impacts on the accuracy of the clustering performance and thus on the reliability of stylometric authorship testing in Arabic. The peculiar stylistic features of the affixation processes in Arabic can, thus, be usefully used for improving the performance of authorship attribution applications in Arabic. It can be finally concluded that stemming is not effective in the stylometric authorship applications in Arabic.

show abstract

Section: Research Questionmentioning

confidence: 99%

The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic

Omar¹,

Hamouda²

2020

IJACSA

View full text Add to dashboard Cite

show abstract

“…For our experiment we contacted the authors of all the aforementioned stemmers to share the source code. Only three agreed and shared the source, [11], [17], [18] for which we are grateful. ARLSTem [19], source is also freely available, but as it is in python, we re-implemented it in java.…”

Section: Related Workmentioning

confidence: 94%

“…On the other hand, LSTM or GRU models performed well either with bidirectional or with attention mechanism. What surprised us was the degraded performance by CNN-LSTM and CNN-GRU on the SPA corpus when used with the light stemmer [18]. For CNN-LSTM, the performance dropped from mid 90's (using any of the other stemming algorithms) to 75.4%, and for CNN-GRU it is even worse where it drops to ≈ 52%.…”

Section: Experimenting With Different Stemming Algorithmsmentioning

confidence: 95%

“…The author in [17] uses snowball, a programming language dedicated for stemming in different human languages. Ref [18], on the other hand, used lexicon resources to improve the stemming.…”

Section: Stemming Algorithmsmentioning

confidence: 99%

“…"electronic", which is left untouched. Furthermore, in the case of [18], a light stemmer, the number of tokens exceeds the case when no stemming is applied. This is because the stemmer often produces more than one stem.…”

Section: Experimenting With Different Stemming Algorithmsmentioning

confidence: 99%

See 2 more Smart Citations

Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

Almuzaini

Azmi

2020

IEEE Access

View full text Add to dashboard Cite

Document classification is a classical problem in information retrieval, and plays an important role in a variety of applications. Automatic document classification can be defined as content-based assignment of one or more predefined categories to documents. Many algorithms have been proposed and implemented to solve this problem in general, however, classifying Arabic documents is lagging behind similar works in other languages. In this paper, we present seven deep learning-based algorithms to classify the Arabic documents. These are: Convolutional Neural Network (CNN), CNN-LSTM (LSTM = Long Short-Term Memory), CNN-GRU (GRU = Gated Recurrent Units), BiLSTM (Bidirectional LSTM), BiGRU, Att-LSTM (Attention-based LSTM), and Att-GRU. And for word representation, we applied the word embedding technique (Word2Vec). We tested our approach on two large datasets-with six and eight categories-using ten-fold cross-validation. Our objective was to study how the classification is affected by the stemming strategies and word embedding. First, we looked into the effects of different stemming algorithms on the document classification with different deep learning models. We experimented with eleven different stemming algorithms, broadly falling into: root-based and stem-based, and no stemming. We performed ANOVA test on the classification results using the different stemmers, which helps assure if the results are significant. The results of our study indicate that stem-based algorithms perform slightly better compared to root-based algorithms. Among the deep learning models, the Attention mechanism and the Bidirectional learning gave outstanding performance with Arabic text categorization. Our best performance is F -score = 97.96%, achieved using the Att-GRU model with stem-based algorithm. Next, we looked into different controlling parameters for word embedding. For Word2Vec, both skip-gram and bag-of-words (CBOW) perform well with either stemming strategies. However, when using a stem-based algorithm, skipgram achieves good results with a vector of smaller dimension, while CBOW requires a larger dimension vector to achieve a similar performance.

show abstract

Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

Namly

Bouzoubaa

Jihad

et al. 2019

Studies in Computational Intelligence

View full text Add to dashboard Cite

Enhancing Arabic stemming process using resources and benchmarking tools

Cited by 23 publications

References 12 publications

The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic

The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic

Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

Contact Info

Product

Resources

About