2016
DOI: 10.3390/a9020027
|View full text |Cite
|
Sign up to set email alerts
|

The Effect of Preprocessing on Arabic Document Categorization

Abstract: Abstract:Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
29
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 60 publications
(36 citation statements)
references
References 29 publications
1
29
0
Order By: Relevance
“…Moreover, Hmeidi et al [12] studied the influence of raw text, khoja root-based stemmer and light stemming of Arabic text documents based on standard classifiers, such as NB, SVM, KNN, J48 and Decision Table classifiers. The results exhibited that the SVM and NB classifiers with light stemming provides better classification accuracy than other classifiers.The same conclusion was drawn up by Al-Badarneh [13] and Ayedh et al [14] by using various pre-processing methods. Additionally, Al-Molegi et al [15] and Khreisat [16] have proposed an approach to classify Arabic text documents based on the combination of N-grams with some similarity measures, including Manhattan, Euclidean distances and Dice.…”
Section: Related Worksupporting
confidence: 78%
“…Moreover, Hmeidi et al [12] studied the influence of raw text, khoja root-based stemmer and light stemming of Arabic text documents based on standard classifiers, such as NB, SVM, KNN, J48 and Decision Table classifiers. The results exhibited that the SVM and NB classifiers with light stemming provides better classification accuracy than other classifiers.The same conclusion was drawn up by Al-Badarneh [13] and Ayedh et al [14] by using various pre-processing methods. Additionally, Al-Molegi et al [15] and Khreisat [16] have proposed an approach to classify Arabic text documents based on the combination of N-grams with some similarity measures, including Manhattan, Euclidean distances and Dice.…”
Section: Related Worksupporting
confidence: 78%
“…Different samples of such insignificant words are pronouns, articles, conjunctions ( ‫ه‬ ، ‫ه‬ ، ‫,)ه‬ prepositions ‫ل،(‬ ، ، ، ، ‫ا‬ ، ), demonstratives, ( ‫او‬ ‫ء،‬ ‫ا،ه‬ ‫)ه‬ and interrogatives ( ، ‫. )ا‬ Besides, Arabic-specific nouns stating place and time ‫ق،(‬ ، ) and symbols (@, #, &, %, *) are considered insignificant and can be removed (Ayedh et al, 2016). (2012) was used and updated by preventing the removal of certain stop-words in documents.…”
Section: Stop-word Removalmentioning
confidence: 99%
“…2. Finally, the character that takes the symbol "ّ " can be replaced by two duplicate characters of the same character, as these characters are used to extract the Arabic roots in order to eliminate them for preventing them from affecting the meaning of the words (Ayedh et al, 2016).…”
Section: Normalizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Removing all the stop words, symbols and stemming to the user queries before tokenization process take in place [14]. Tokenization will produces queries catchphrases in view of significance idea words.…”
Section: Tokenizationmentioning
confidence: 99%