1998
DOI: 10.1007/bfb0026677
|View full text |Cite
|
Sign up to set email alerts
|

Feature subset selection in text-learning

Abstract: Abstract. This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
48
1
1

Year Published

2002
2002
2009
2009

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 94 publications
(52 citation statements)
references
References 8 publications
(6 reference statements)
2
48
1
1
Order By: Relevance
“…The combination information gain and Bayesian classifier outperform all methods only in the experiment on 2,000 training pages. Our results are at divergence with those reported in the work by Mladenic [12], according to which information gain performs worse than random feature selection. A closer look at words sorted by information gain shows that almost all best words are characteristic of a negative class value.…”
Section: Design Of the Experiments And Resultscontrasting
confidence: 99%
See 1 more Smart Citation
“…The combination information gain and Bayesian classifier outperform all methods only in the experiment on 2,000 training pages. Our results are at divergence with those reported in the work by Mladenic [12], according to which information gain performs worse than random feature selection. A closer look at words sorted by information gain shows that almost all best words are characteristic of a negative class value.…”
Section: Design Of the Experiments And Resultscontrasting
confidence: 99%
“…For two-class problems, Mladenic [12] compared scoring measures based on the Odds ratio 1 and those based on information gain, 2 leading her to favor the former. For multi-class problems, as in the case of WebClass, an extension of the well-known TF-IDF measure to text categorization, originally proposed for information retrieval purposes [17], has been suggested [6].…”
Section: The Preprocessing Phasementioning
confidence: 99%
“…For each list, suppress words from x until the log-likelihood (LL) of y is less than the LL of the k closest classes. In addition to the PosMin metric as described in Sec 2, we used three standard feature selection metrics described in [4]: InfoGain, OddsRatio, and FreqLogP. Using a leave-one-out procedure we trained new naive Bayes and SVM classifiers on the corpus minus each redacted document, and tested on the redacted sets.…”
Section: Discussionmentioning
confidence: 99%
“…All performance values reported were obtained using 2-fold cross-validation scheme. In addition to our two metrics, we carried out experiments with other reportedly most effective metrics: χ 2 (CHI), the multi-class version of Information Gain (IG), Document Frequency (DF) (Yang and Pedersen, 1997), the binary version of Information Gain (IG2) (Gabrilovich and Markovitch, 2004), Bi-normal separation (BNS) (Forman, 2003), Odds Ratio (OR) (Mladenic, 1998). For metrics with one value per category (χ 2 , IG2, BNS, OR), we used the maximum value as the score, for it performs better than the average value across metrics, classifiers, and text collections (Rogati and Yang, 2002).…”
Section: Performance Measures and Feature Selection Methodsmentioning
confidence: 99%
“…Third, the problem of feature selection in text categorization has been intensively studied, e.g., in (Yang and Pedersen, 1997;Mladenic, 1998;Soucy and Mineau, 2001;Forman, 2003;Gabrilovich and Markovitch, 2004). These works, as well as our methods, use the filtering approach, in which terms are scored by a metric, then the highest ranked terms are selected (Sebastiani, 2002).…”
Section: Related Workmentioning
confidence: 99%