Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

Almuzaini, Huda Abdulrahman; Azmi, Aqil M.

doi:10.1109/access.2020.3009217

Cited by 73 publications

(47 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Preprocessing is a key task in semantic text similarity process. Stemming is an important technique adopted for preprocessing texts due to the fact that it reduces feature space and improves performance of the similarity process ( Alhaj et al, 2019 ; Almuzaini & Azmi, 2020 ).…”

Section: Related Workmentioning

confidence: 99%

“…Stemming effect has been studied and applied to different domains of NLP and computation linguistics. This includes document categorization ( Alhaj et al, 2019 ; Almuzaini & Azmi, 2020 ), information retrieval ( Zeroual & Lakhouaja, 2017 ; Alnaied, Elbendak & Bulbul, 2020 ), automatic essay scoring ( Al-Shalabi, 2016 ), and sentiment analysis ( Al-Saqqa, Awajan & Ghoul, 2019 ). In all these studies it has been reported that stemming and lemmatization improves the performance of the resulted models.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Effect of stemming on text similarity for Arabic language at sentence level

Alhawarat

Abdeljaber

Hilal

2021

PeerJ Computer Science

View full text Add to dashboard Cite

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Effect of stemming on text similarity for Arabic language at sentence level

Alhawarat

Abdeljaber

Hilal

2021

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Pengujian dilakukan dengan mengolah atribut judul tersebut melalui proses stemming biasa dan proses modifikasi stemming. Kedua proses tersebut akan menghasilkan perbandingan nilai recall pada judul [10], [11].…”

Section: P-issn: 2621-8070 E-issn:2686-3219unclassified

Analisa Modifikasi Algoritma Stemming Untuk Kasus Overstemming

Hersianie

2020

teknokom

View full text Add to dashboard Cite

Overstemming merupakan pemenggalan kata ke bentuk asal (root word) yang berlebihan. Hal ini menyebabkan kata tersebut bermakna sangat berbeda dengan kata asal. Namun, stem yang dihasilkan sama bentuknya. Untuk mengatasi permasalahan tersebut, penelitian sebelumnya telah menerapkan algoritma stemming dengan tabel aturan kata. Namun kekurangan dari tabel aturan kata ini adalah kesulitan dalam menambahkan jenis kata yang mengalami overstemming. Oleh karena itu, penelitian ini bertujuan untuk memodifikasi algoritma overstemming tersebut. Penelitian ini akan menggabungkan algoritma stemming (hybrid stemming) yaitu algoritma look-up table, tabel aturan kata dan algoritma stemming Porter yang biasa digunakan. Dataset yang digunakan dalam pengujian adalah atribut judul pada dokumen publikasi ilmiah. Hasil pengujian menunjukkan bahwa modifikasi algoritma stemming menghasilkan recall sebesar 89, 9%.Saran untuk penelitian selanjutnya adalah pengujian dapat dilakukan menggunakan atribut lainnnya pada dokumen publikasi.

show abstract

“…The stemming rather reduces the information gained from the data in many languages. In fact, the stemming improves accuracy (ACC [28]) achieved by various methods in different languages including not only English [29], [30] but also Arabic [26], [27], [31], [32], Indonesian [23], [33], [34], Japanese [25], [35] French [36]- [38], Portuguese [37], [39], German [37], [40], [41], Hungarian [37], [42], [43], Spanish [44]- [47], and Turkish [48]- [50].…”

Section: Introductionmentioning

confidence: 99%

Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data

Shehu

Sharif

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Sentiment analysis using stemmed Twitter data from various languages is an emerging research topic. In this paper, we address three data augmentation techniques namely Shift, Shuffle, and Hybrid to increase the size of the training data; and then we use three key types of deep learning (DL) models namely recurrent neural network (RNN), convolution neural network (CNN), and hierarchical attention network (HAN) to classify the stemmed Turkish Twitter data for sentiment analysis. The performance of these DL models has been compared with the existing traditional machine learning (TML) models. The performance of TML models has been affected negatively by the stemmed data, but the performance of DL models has been improved greatly with the utilization of the augmentation techniques. Based on the simulation, experimental, and statistical results analysis deeming identical datasets, it has been concluded that the TML models outperform the DL models with respect to both training-time (TTM) and runtime (RTM) complexities of the algorithms; but the DL models outperform the TML models with respect to the most important performance factors as well as the average performance rankings.

show abstract

Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

Cited by 73 publications

References 45 publications

Effect of stemming on text similarity for Arabic language at sentence level

Effect of stemming on text similarity for Arabic language at sentence level

Analisa Modifikasi Algoritma Stemming Untuk Kasus Overstemming

Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data

Contact Info

Product

Resources

About