Feature selection methods for text classification: a systematic literature review

Pintas, Julliano Trindade; Fernandes, Leandro A. F.; Garcia, Ana Cristina Bicharra

doi:10.1007/s10462-021-09970-6

Cited by 73 publications

(38 citation statements)

References 189 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The reason for using a CC is that it provides accurate results with reasonably fast execution. The same reasoning applies to the use of NBs as base classifiers and it is also a popular base classifier that is frequently used in the literature (Pintas et al, 2021). Furthermore, our concept drift adapting strategy resets the classifier if drift is detected.…”

Section: Experimental Setup and Evaluation Metricsmentioning

confidence: 91%

Implicit Concept Drift Detection for Multi-label Data Streams

Gulcan¹,

Can²

2022

Preprint

View full text Add to dashboard Cite

Many real-world applications adopt multi-label data streams as the need for algorithms to deal with rapidly changing data increases. Changes in data distribution, also known as concept drift, cause the existing classification models to rapidly lose their effectiveness. To assist the classifiers, we propose a novel algorithm called Label Dependency Drift Detector (LD3), an implicit (unsupervised) concept drift detector using label dependencies within the data for multi-label data streams. Our study exploits the dynamic temporal dependencies between labels using a label influence ranking method, which leverages a data fusion algorithm and uses the produced ranking to detect concept drift. LD3 is the first unsupervised concept drift detection algorithm in the multi-label classification problem area. In this study, we perform an extensive evaluation of LD3 by comparing it with 14 prevalent supervised concept drift detection algorithms that we adapt to the problem area using 12 datasets and a baseline classifier. The results show that LD3 provides between 19.8% and 68.6% better predictive performance than comparable detectors on both real-world and synthetic data streams.

show abstract

Section: Experimental Setup and Evaluation Metricsmentioning

confidence: 91%

Implicit Concept Drift Detection for Multi-label Data Streams

Gulcan¹,

Can²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In this paper, the information gain (IG) as the filter method and the genetic algorithm (GA) as the wrapper method are used for feature selection. Particularly, these two methods have been used in many research problems, including text classification [20], gene expression microarray analysis [21], intrusion detection [22], financial distress prediction [23], software defect prediction [24], etc.…”

Section: The Feature Selection and Over-sampling Methodsmentioning

confidence: 99%

On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

et al. 2021

View full text Add to dashboard Cite

Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.

show abstract

“…Feature selection represents an important tool to balance the number of selected attributes to avoid overfitting the model (with too few attributes) and expensive computational time (with too many attributes). There are many methods for feature selection such as: wrapper methods [11], filter methods and unsupervised methods. Wrapper and filter methods are considered supervised approaches as they utilize the output to produce the best set of features.…”

Section: Related Workmentioning

confidence: 99%

Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Elhussein

Brahimi

2021

ACI

View full text Add to dashboard Cite

PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.

show abstract

Feature selection methods for text classification: a systematic literature review

Cited by 73 publications

References 189 publications

Implicit Concept Drift Detection for Multi-label Data Streams

Implicit Concept Drift Detection for Multi-label Data Streams

On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Contact Info

Product

Resources

About