Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2003
DOI: 10.1145/956750.956758
|View full text |Cite
|
Sign up to set email alerts
|

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near lin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
243
0
2

Year Published

2008
2008
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 401 publications
(259 citation statements)
references
References 16 publications
0
243
0
2
Order By: Relevance
“…A non-parametric approach for discovering outliers uses distance metrics and defines a point to be a distance outlier if at least a user-defined fraction of the points in the data set are further away than some user-defined minimum distance from that point Ng 1998, 1999;Knorr et al 2000, Bay andSchwabacher 2003). A critical issue related to such distance-based methods is the arbitrariness of many user-supplied quantities which often require extensive human interactions and several iterations to determine an outlier.…”
Section: Outlier Detection Methodsmentioning
confidence: 99%
“…A non-parametric approach for discovering outliers uses distance metrics and defines a point to be a distance outlier if at least a user-defined fraction of the points in the data set are further away than some user-defined minimum distance from that point Ng 1998, 1999;Knorr et al 2000, Bay andSchwabacher 2003). A critical issue related to such distance-based methods is the arbitrariness of many user-supplied quantities which often require extensive human interactions and several iterations to determine an outlier.…”
Section: Outlier Detection Methodsmentioning
confidence: 99%
“…We selected a greyscale image resolution of 192×144, yielding a high representational dimensionality of 27,648 pixel features. Figure 2 shows the top 12 global outliers of this dataset, discovered using a distance-based outlier technique with k = 20 [5]. These images contain large brightly-lit areas and unusual shapes, features that make these images stand out distinctly from the rest of the data set as a whole.…”
Section: Examplementioning
confidence: 99%
“…Nevertheless, this approach may be applicable in place of other traditional indexing methods such as kD-trees. The pruning approach used by Bay and Schwabacher [5] cannot be readily applied to LOF due to the large amount of overlapping density computation required to find the LOF of a single point. Kriegel et.…”
Section: Outlier Detection and Scalabilitymentioning
confidence: 99%
“…Moreover, data's non-stationarity forces detector's models to be continuously updated, which is again very difficult with indexing techniques, as created indexes would need to be recalculated, which is usually expensive. Other methods, such as Bay and Schwabacher (2003) assumes some additional knowledge which might not be available. The presented anomaly detector has been designed with respect to these constraints, and it has been shown to achieve state of the art accuracy measured by area under ROC curve.…”
Section: Introductionmentioning
confidence: 99%