Proceedings of the Eleventh International Conference on Information and Knowledge Management 2002
DOI: 10.1145/584792.584911
|View full text |Cite
|
Sign up to set email alerts
|

High-performing feature selection for text classification

Abstract: This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
35
0

Year Published

2005
2005
2021
2021

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 208 publications
(43 citation statements)
references
References 3 publications
(2 reference statements)
3
35
0
Order By: Relevance
“…Several studies [31,32] found that feature selection methods based on χ 2 statistics consistently outperformed those based on other criteria (including information gain) for the most popular classifiers used in TC. The terms with a document frequency less than 5 were also removed, as χ 2 is known to be less reliable for rare words [31].…”
Section: Text Classification: Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Several studies [31,32] found that feature selection methods based on χ 2 statistics consistently outperformed those based on other criteria (including information gain) for the most popular classifiers used in TC. The terms with a document frequency less than 5 were also removed, as χ 2 is known to be less reliable for rare words [31].…”
Section: Text Classification: Experimental Settingsmentioning
confidence: 99%
“…The terms with a document frequency less than 5 were also removed, as χ 2 is known to be less reliable for rare words [31]. Both methods were applied and 10% of the terms were selected for the vector space representation.…”
Section: Text Classification: Experimental Settingsmentioning
confidence: 99%
“…-Support-vector machines (SVM) [Cortes and Vapnik 1995] are a class of powerful methods for classification tasks, involving the construction of hyperplanes that have the largest distance to the nearest training points. Several papers reference support-vector machines as the state-of-the-art method for text classification [Gabrilovich and Markovitch 2004;Rogati and Yang 2002;Tong and Koller 2000]. We use a nonlinear poly-2 kernel [Joachims 1998] to train our classifiers, as preliminary experiments with a linear kernel did not yield statistically significant differences with a poly-2 kernel, which has also been a finding in some recent empirical evaluation of SVM kernels [Gao and Sun 2010].…”
Section: Statistical Machine Learning Techniquesmentioning
confidence: 92%
“…Feature selection can defined as the process in which the irrelevant features are deducted and detecting only the relevant ones, an optimal selection of features can bring improvement in overall knowledge of domain, reduced size, generalization capacity etc.., [9] J.Arturo Olvera Lopez stated that sufficient identification of features is necessary in real world scenario, hence the identification of features is important. [20] Yiming Yang stated that, feature selection is the best solution for text classification problems it increases both the classification effectiveness and also computational efficiency. Instance selection is a process, in which the dataset size is reduced , which eventually decreases the runtime, especially in the case of instance based classifiers, the commonly used instance selection mechanisms are wrapper and filter, here…”
Section: Iterature Surveymentioning
confidence: 99%