Feature selection using an improved Chi-square for Arabic text classification

Bahassine, Said; Madani, Abdellah; Al-Sarem, Mohammed; Kissi, Mohamed

doi:10.1016/j.jksuci.2018.05.010

Cited by 179 publications

(127 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In their study, Yelmen et all., established a two-step FS strategy that used IG and genetic search (GS) to obtain the optimum feature subset for sentiment classification [25]. Bahassine et al used an improved FS strategy for Arabic text analysis was and developed an improved version of the CHI filter approach to classify a document of six classes [26].…”

Section: Feature Selection In Text Categorizationmentioning

confidence: 99%

Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

Borandağ¹,

Özçift²,

Kaygusuz³

2021

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

The increase in the number of texts as digital documents from numerous sources such as customer reviews, news, and social media has made text categorization a crucial in order to be able to manage the enormous amount of data. The high dimensional nature of these texts requires a preliminary feature selection task to reduce the feature space with a potential increase in the prediction accuracy. In this study, we developed an ensemble feature selection method, namely majority vote rank allocation, was developed for Turkish text categorization purposes. The method uses a majority voting ensemble strategy in combination with a rank allocation approach to combine weak filters such as information gain, symmetric uncertainty, relief and correlation-based feature selection. Thus, the proposed method measures the quality of the features among all features with the majority votes of the filters and ranking allocation. The feature selection efficacy of the method was tested on two datasets, one from the literature and a newly collected dataset. The effect of the obtained features on the classification prediction performance was evaluated on top of the naive bayes, support vector machine, J48 and random forests algorithms. It was empirically observed that the developed method improved the prediction accuracies of the classifiers compared to the mentioned filters. The statistical significance of the experimental results were also validated with the use of a two-way Analysis of Variance test.

show abstract

Section: Feature Selection In Text Categorizationmentioning

confidence: 99%

Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

Borandağ¹,

Özçift²,

Kaygusuz³

2021

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

show abstract

“…p(t) and p(t -) are the probabilities of presence and absence of term t respectively. P(ci|t) and P(ci|t -) are the conditional probabilities of class ci considering presence and absence of t respectively [21], [7], [49], and [4]. IG is used to reduce the entropy caused by partitioning the objects according to an attribute.…”

Section: Performance%mentioning

confidence: 99%

“…There are several algorithms that can be used to classify documents. Examples of such algorithms include; but not limited to; K-nearest neighbor (KNN), support vector machine (SVM), logistic regression (LR), random forest (RF), Naïve Bayes (NB), decision tree (DT), artificial neural network (ANN), and others [2], [4][5][6][7][8][9][10][11][12][13][14][15]. One of the main problems of classifying documents is the huge number of features which are describing a dataset.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Machine Learning and Feature Selection Approaches for Categorizing Arabic Text: Analysis, Comparison, and Proposal

Elnahas

El-Fishawy

Nour

et al. 2020

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

This work adopts some classification approaches for categorizing Arabic text. The approaches are operated on two datasets as test-beds. A comparative study is done to evaluate the performance of the adopted classifiers. Some feature selection methods are also analyzed, investigated, and evaluated. Selecting the most significant features is important because the huge number of features may cause performance degradation for text classification. A comparative study is done among the adopted feature selection methods for classifying Arabic documents. Moreover, a modification is done on the feature selection approaches by doing amalgamation for the chosen methods. A novel method is also proposed for selecting the most appropriate features. The method is based on the semantic fusion and multiple-words (SF-MW) for constructing the features. A comparison is done among the adopted feature selection methods and the proposed one. The experimental results show that the best performance was for the SVM classifier compared to the KNN and NB classifiers. The combination among the adopted feature selection methods presents better results compared to the individual adopted ones. The proposed feature selection method (SF-MW) is promising as it reduced the features and achieved higher classification accuracy. The accuracy improvement was about 22% for the two chosen Arabic test-beds which contain 1246 and 1500 documents respectively. The proposed method is expected to be also efficient for other Arabic and English datasets.

show abstract

“…Bước 2: Lựa chọn đặc trưng: Chúng tôi sử dụng phương pháp Chi Square (CHI) (Bahassine, Madani, Al-Sarem, & Kissi, 2018;Thabtah, 2018) để đánh giá giá độ liên quan của các đặc trưng tới kết quả phân lớp.…”

Section: Xây Dựng Mô Hình Dự đOánunclassified

Ứng Dụng Các Thuật Toán Học Máy Để Đánh Giá Bộ Cơ Sở Dữ Liệu Trong Phân Loại Rối Loạn Phổ Tự Kỷ

Thuận¹,

Thuận²

2020

DLU JOS

View full text Add to dashboard Cite

Bài báo này, chúng tôi trình bày kết quả đánh giá bộ cơ sở dữ liệu trong phân loại rối loạn phổ tự kỷ (ASD) trẻ em trên kho dữ liệu UCI. Chúng tôi tiến hành đánh giá bộ dữ liệu với các thuật toán SVM và Random Forest, đồng thời khảo sát thêm các thuật toán Decision Trees, Logistic Regression, K-Nearest-Neighbors, Naïve Bayes, và mạng nơ-ron Multi Layer Perceptron (MLP). Kết quả thử nghiệm trên bảy thuật toán cho kết quả phân loại cao phù hợp với các nghiên cứu trước đó. Chúng tôi kết luận bộ dữ liệu phân loại rối loạn phổ tự kỷ trẻ em trên kho dữ liệu UCI là đáng tin cậy.

show abstract

Feature selection using an improved Chi-square for Arabic text classification

Cited by 179 publications

References 14 publications

Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

Development of majority vote ensemble feature selection algorithm augmented with rank allocation to enhance Turkish text categorization

Machine Learning and Feature Selection Approaches for Categorizing Arabic Text: Analysis, Comparison, and Proposal

Ứng Dụng Các Thuật Toán Học Máy Để Đánh Giá Bộ Cơ Sở Dữ Liệu Trong Phân Loại Rối Loạn Phổ Tự Kỷ

Contact Info

Product

Resources

About