Performance thresholding in practical text classification

Schütze, Hinrich; Velipasaoglu, Emre; Pedersen, Jan O.

doi:10.1145/1183614.1183709

Cited by 32 publications

(27 citation statements)

References 25 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This step corresponds to uncertainty sampling [20], a classical active learning method applied in [1]. Uncertainty sampling suffers, however, from sampling bias [29]. We also perform rare category detection to foster the discovery of yet unknown families.…”

Section: Uncertainty Samplingmentioning

confidence: 99%

“…Active learning methods have been proposed to reduce the labelling cost by asking the expert to annotate only the most informative examples [32]. However, classical active learning methods often suffer from sampling bias [29,34]: a family (a group of similar malicious or benign examples) may be completely overlooked by the annotation queries as the expert is asked to annotate only the most informative examples. Sampling bias is a significant issue in intrusion detection: it may lead to missing a malicious family during the labelling process, and being unable to detect it thereafter.…”

Section: Introductionmentioning

confidence: 99%

“…Active learning methods rely on an interactive process where the expert is asked to annotate some instances from a large unlabelled pool to improve the current detection model and the relevance of the future annotation queries (see Figure 1). However, annotating only the most informative instances may cause a family of observations to be completely missed by the labelling process (see [8,29] for theoretical examples) and, therefore, may have a negative impact on the performance of the detection model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ILAB: An Interactive Labelling Strategy for Intrusion Detection

Beaugnon

Chifflier

Bach

2017

Research in Attacks, Intrusions, and Defenses

View full text Add to dashboard Cite

Abstract. Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.

show abstract

Section: Uncertainty Samplingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ILAB: An Interactive Labelling Strategy for Intrusion Detection

Beaugnon

Chifflier

Bach

2017

Research in Attacks, Intrusions, and Defenses

View full text Add to dashboard Cite

show abstract

“…During AL, as more and more labels are obtained, the training set quickly diverges from the underlying data distribution. (Schütze et al, 2006) states that AL can explore the feature space in such a biased way that it can end up ignoring entire clusters of unlabeled instances. We think that SWSD is highly prone for the mentioned missed cluster problem because of its unique nature.…”

Section: Effect Of Active Selection Strategymentioning

confidence: 99%

Iterative Constrained Clustering for Subjectivity Word Sense Disambiguation

Akkaya

Wiebe

Mihalcea

2014

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Subjectivity word sense disambiguation (SWSD) is a supervised and applicationspecific word sense disambiguation task disambiguating between subjective and objective senses of a word. Not surprisingly, SWSD suffers from the knowledge acquisition bottleneck. In this work, we use a "cluster and label" strategy to generate labeled data for SWSD semiautomatically. We define a new algorithm called Iterative Constrained Clustering (ICC) to improve the clustering purity and, as a result, the quality of the generated data. Our experiments show that the SWSD classifiers trained on the ICC generated data by requiring only 59% of the labels can achieve the same performance as the classifiers trained on the full dataset.

show abstract

“…And this is just in one dimension; in high dimension, the problem can be expected to be worse, since there are more places for this troublesome group to be hiding out. For a discussion of this problem in text classification, see the recent paper of Schutze et al (2006).…”

Section: Active Learning and Sampling Biasmentioning

confidence: 99%