Feature subset selection in text-learning

Mladenić, Dunja

doi:10.1007/bfb0026677

Cited by 94 publications

(52 citation statements)

References 8 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The combination information gain and Bayesian classifier outperform all methods only in the experiment on 2,000 training pages. Our results are at divergence with those reported in the work by Mladenic [12], according to which information gain performs worse than random feature selection. A closer look at words sorted by information gain shows that almost all best words are characteristic of a negative class value.…”

Section: Design Of the Experiments And Resultscontrasting

confidence: 99%

“…For two-class problems, Mladenic [12] compared scoring measures based on the Odds ratio 1 and those based on information gain, 2 leading her to favor the former. For multi-class problems, as in the case of WebClass, an extension of the well-known TF-IDF measure to text categorization, originally proposed for information retrieval purposes [17], has been suggested [6].…”

Section: The Preprocessing Phasementioning

confidence: 99%

See 1 more Smart Citation

Mining HTML Pages to Support Document Sharing in a Cooperative System

Malerba

Esposito

Ceci

2002

XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops

View full text Add to dashboard Cite

Abstract. In this paper, the problem of classifying HTML documents is investigated in the context of a client-server application, named WebClass, developed to support the search activity of a geographically distributed group of people with common interests. The two main issues studied in the paper are the selection of some features to represent HTML documents and the construction of the classifiers. A new feature selection technique is presented and its interaction with different classifiers is experimentally studied. Results show that performance improves even with simple classifiers and the proposed feature selection technique compares favorably with respect to other well-known approaches.

show abstract

Section: Design Of the Experiments And Resultscontrasting

confidence: 99%

Section: The Preprocessing Phasementioning

confidence: 99%

Mining HTML Pages to Support Document Sharing in a Cooperative System

Malerba

Esposito

Ceci

2002

XML-Based Data Management and Multimedia Engineering — EDBT 2002 Workshops

View full text Add to dashboard Cite

show abstract

“…For each list, suppress words from x until the log-likelihood (LL) of y is less than the LL of the k closest classes. In addition to the PosMin metric as described in Sec 2, we used three standard feature selection metrics described in [4]: InfoGain, OddsRatio, and FreqLogP. Using a leave-one-out procedure we trained new naive Bayes and SVM classifiers on the corpus minus each redacted document, and tested on the redacted sets.…”

Section: Discussionmentioning

confidence: 99%

Protecting Sensitive Topics in Text Documents with PROTEXTOR

Cumby

2009

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. This is a demonstration of a system for protecting sensitive topics present in text documents. Our system works in a privacy framework where the topic is characterized as a multiclass classification problem in a generative setting. We show how our system helps a user redact a document in a business setting to obscure what company the text pertains to, and show some experimental results on redacting the topic for a standard text classification data set.

show abstract

“…All performance values reported were obtained using 2-fold cross-validation scheme. In addition to our two metrics, we carried out experiments with other reportedly most effective metrics: χ 2 (CHI), the multi-class version of Information Gain (IG), Document Frequency (DF) (Yang and Pedersen, 1997), the binary version of Information Gain (IG2) (Gabrilovich and Markovitch, 2004), Bi-normal separation (BNS) (Forman, 2003), Odds Ratio (OR) (Mladenic, 1998). For metrics with one value per category (χ 2 , IG2, BNS, OR), we used the maximum value as the score, for it performs better than the average value across metrics, classifiers, and text collections (Rogati and Yang, 2002).…”

Section: Performance Measures and Feature Selection Methodsmentioning

confidence: 99%

“…Third, the problem of feature selection in text categorization has been intensively studied, e.g., in (Yang and Pedersen, 1997;Mladenic, 1998;Soucy and Mineau, 2001;Forman, 2003;Gabrilovich and Markovitch, 2004). These works, as well as our methods, use the filtering approach, in which terms are scored by a metric, then the highest ranked terms are selected (Sebastiani, 2002).…”

Section: Related Workmentioning

confidence: 99%

Classifying Structured Web Sources Using Aggressive Feature Selection

2009

Proceedings of the Fifth International Conference on Web Information Systems and Technologies

View full text Add to dashboard Cite

Abstract:This paper studies the problem of classifying structured data sources on the Web. While prior works use all features, once extracted from search interfaces, we further refine the feature set. In our research, each search interface is treated simply as a bag-of-words. We choose a subset of words, which is suited to classify web sources, by our feature selection methods with new metrics and a novel simple ranking scheme. Using aggressive feature selection approach, together with a Gaussian process classifier, we obtained high classification performance in an evaluation over real web data.

show abstract

Feature subset selection in text-learning

Cited by 94 publications

References 8 publications

Mining HTML Pages to Support Document Sharing in a Cooperative System

Mining HTML Pages to Support Document Sharing in a Cooperative System

Protecting Sensitive Topics in Text Documents with PROTEXTOR

Classifying Structured Web Sources Using Aggressive Feature Selection

Contact Info

Product

Resources

About