Proceedings of the 15th ACM International Conference on Information and Knowledge Management - CIKM '06 2006
DOI: 10.1145/1183614.1183709
|View full text |Cite
|
Sign up to set email alerts
|

Performance thresholding in practical text classification

Abstract: In practical classification, there is often a mix of learnable and unlearnable classes and only a classifier above a minimum performance threshold can be deployed. This problem is exacerbated if the training set is created by active learning. The bias of actively learned training sets makes it hard to determine whether a class has been learned. We give evidence that there is no general and efficient method for reducing the bias and correctly identifying classes that have been learned. However, we characterize … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2008
2008
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 32 publications
(27 citation statements)
references
References 25 publications
(21 reference statements)
0
27
0
Order By: Relevance
“…This step corresponds to uncertainty sampling [20], a classical active learning method applied in [1]. Uncertainty sampling suffers, however, from sampling bias [29]. We also perform rare category detection to foster the discovery of yet unknown families.…”
Section: Uncertainty Samplingmentioning
confidence: 99%
See 2 more Smart Citations
“…This step corresponds to uncertainty sampling [20], a classical active learning method applied in [1]. Uncertainty sampling suffers, however, from sampling bias [29]. We also perform rare category detection to foster the discovery of yet unknown families.…”
Section: Uncertainty Samplingmentioning
confidence: 99%
“…Active learning methods have been proposed to reduce the labelling cost by asking the expert to annotate only the most informative examples [32]. However, classical active learning methods often suffer from sampling bias [29,34]: a family (a group of similar malicious or benign examples) may be completely overlooked by the annotation queries as the expert is asked to annotate only the most informative examples. Sampling bias is a significant issue in intrusion detection: it may lead to missing a malicious family during the labelling process, and being unable to detect it thereafter.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…During AL, as more and more labels are obtained, the training set quickly diverges from the underlying data distribution. (Schütze et al, 2006) states that AL can explore the feature space in such a biased way that it can end up ignoring entire clusters of unlabeled instances. We think that SWSD is highly prone for the mentioned missed cluster problem because of its unique nature.…”
Section: Effect Of Active Selection Strategymentioning
confidence: 99%
“…And this is just in one dimension; in high dimension, the problem can be expected to be worse, since there are more places for this troublesome group to be hiding out. For a discussion of this problem in text classification, see the recent paper of Schutze et al (2006).…”
Section: Active Learning and Sampling Biasmentioning
confidence: 99%