Learning when negative examples abound

Kubát, Miroslav; Holte, Robert C.; Matwin, Stan

doi:10.1007/3-540-62858-4_79

Cited by 251 publications

(178 citation statements)

References 6 publications

Supporting

Mentioning

168

Contrasting

Unclassified

Order By: Relevance

“…We would have had to model both the within-batch characteristics and the across-batch characteristics, and we simply did not have enough data or batches to do this with any certainty. To try to ensure that our learning algorithm is not specific to our particular dataset, we have tested it on other datasets having similar characteristics (Kubat, Holte & Matwin, 1997).…”

Section: Methodological Issuesmentioning

confidence: 99%

Untitled

1998

Self Cite

View full text Add to dashboard Cite

Abstract. During a project examining the use of machine learning techniques for oil spill detection, we encountered several essential questions that we believe deserve the attention of the research community. We use our particular case study to illustrate such issues as problem formulation, selection of evaluation measures, and data preparation. We relate these issues to properties of the oil spill application, such as its imbalanced class distribution, that are shown to be common to many applications. Our solutions to these issues are implemented in the Canadian Environmental Hazards Detection System (CEHDS), which is about to undergo field testing.

show abstract

Section: Methodological Issuesmentioning

confidence: 99%

Untitled

1998

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since we have unbalanced class distribution (see Table 1), Classification accuracy can give misleading results. For such domains more appropriate measure is Information score [7] or Geometric mean of accuracy [8]. In the experimental results presented in Figure 1 Classification accuracy and Information score are used to estimate model quality.…”

Section: Methodsmentioning

confidence: 99%

Feature subset selection in text-learning

Mladenić

1998

Machine Learning: ECML-98

View full text Add to dashboard Cite

Abstract. This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval. I n t r o d u c t i o nIn propositional learning problem domain is given by a set of examples, where each example is described with a class value and a vector of feature values. Features used to describe examples are not necessary all relevant and beneficial for the inductive learning and may reduce quality of induced model. Additionally, a high number of features may slow down the induction process while giving similar results as obtained with much smaller feature subset. Section 2 describes approach commonly used for feature subset selection in learning on text data (text-learning). In Section 4 we experimentally compare different feature scoring measures on real-world data collected from Web users. Section 3 describes our problem domain and naive Bayesian classifier for text that we used in experiments. Discussion is given in Section 5. 2F e a t u r e s u b s e t s e l e c t i o n approaches Different methods have been developed and used for feature subset selection in statistics, p~ttern recognition and machine learning, using different search strategies and evaluation functions. John et al. [4] pointed out the difference between the two main approaches used in machine learning to feature subset selection: filtering approach where the feature subset is selected independent of the learning method and wrapper approach where the feature subset is selected using the same learning algorithm that will be used for learning on domain represented with the selected feature subset.

show abstract

“…Commonly used metrics for two classes problems 24,45 include the arithmetic and geometric means of the sensitivity acc + = TP rate and the specificity acc = TN rate . In particular, the geometric mean of both values is an interesting indicator of the quality of a classifier for imbalanced data, because it is high when both acc + and acc are high or when the difference between acc+ and acc is small 29 . Optimizing the geometric mean is a compromise intended for maximizing the accuracy on both classes while keeping these accuracies balanced 30 .…”

Section: Notation and Metrics For Two-classes Problemsmentioning

confidence: 99%

Equalizing imbalanced imprecise datasets for genetic fuzzy classifiers

Sánchez¹,

Couso²

2012

IJCIS

View full text Add to dashboard Cite

Determining whether an imprecise dataset is imbalanced is not immediate. The vagueness in the data causes that the prior probabilities of the classes are not precisely known, and therefore the degree of imbalance can also be uncertain. In this paper we propose suitable extensions of different resampling algorithms that can be applied to interval valued, multi-labelled data. By means of these extended preprocessing algorithms, certain classification systems designed for minimizing the fraction of misclassifications are able to produce knowledge bases that are also adequate under common metrics for imbalanced classification.

show abstract

Learning when negative examples abound

Cited by 251 publications

References 6 publications

Untitled

Untitled

Feature subset selection in text-learning

Equalizing imbalanced imprecise datasets for genetic fuzzy classifiers

Contact Info

Product

Resources

About