Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Bay, Stephen D.; Schwabacher, Mark

doi:10.1145/956750.956758

Cited by 401 publications

(259 citation statements)

References 16 publications

Supporting

Mentioning

243

Contrasting

Unclassified

Order By: Relevance

“…A non-parametric approach for discovering outliers uses distance metrics and defines a point to be a distance outlier if at least a user-defined fraction of the points in the data set are further away than some user-defined minimum distance from that point Ng 1998, 1999;Knorr et al 2000, Bay andSchwabacher 2003). A critical issue related to such distance-based methods is the arbitrariness of many user-supplied quantities which often require extensive human interactions and several iterations to determine an outlier.…”

Section: Outlier Detection Methodsmentioning

confidence: 99%

A framework of irregularity enlightenment for data pre-processing in data mining

Duan

Hesar

et al. 2008

Ann Oper Res

View full text Add to dashboard Cite

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.

show abstract

Section: Outlier Detection Methodsmentioning

confidence: 99%

A framework of irregularity enlightenment for data pre-processing in data mining

Duan

Hesar

et al. 2008

Ann Oper Res

View full text Add to dashboard Cite

show abstract

“…We selected a greyscale image resolution of 192×144, yielding a high representational dimensionality of 27,648 pixel features. Figure 2 shows the top 12 global outliers of this dataset, discovered using a distance-based outlier technique with k = 20 [5]. These images contain large brightly-lit areas and unusual shapes, features that make these images stand out distinctly from the rest of the data set as a whole.…”

Section: Examplementioning

confidence: 99%

“…Nevertheless, this approach may be applicable in place of other traditional indexing methods such as kD-trees. The pruning approach used by Bay and Schwabacher [5] cannot be readily applied to LOF due to the large amount of overlapping density computation required to find the LOF of a single point. Kriegel et.…”

Section: Outlier Detection and Scalabilitymentioning

confidence: 99%

Density-preserving projections for large-scale local anomaly detection

2011

View full text Add to dashboard Cite

Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient subquadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it

show abstract

“…Moreover, data's non-stationarity forces detector's models to be continuously updated, which is again very difficult with indexing techniques, as created indexes would need to be recalculated, which is usually expensive. Other methods, such as Bay and Schwabacher (2003) assumes some additional knowledge which might not be available. The presented anomaly detector has been designed with respect to these constraints, and it has been shown to achieve state of the art accuracy measured by area under ROC curve.…”

Section: Introductionmentioning

confidence: 99%

Loda: Lightweight on-line detector of anomalies

2015

View full text Add to dashboard Cite

In supervised learning it has been shown that a collection of weak classifiers can result in a strong classifier with error rates similar to those of more sophisticated methods. In unsupervised learning, namely in anomaly detection such a paradigm has not yet been demonstrated despite the fact that many methods have been devised as counterparts to supervised binary classifiers. This work partially fills the gap by showing that an ensemble of very weak detectors can lead to a strong anomaly detector with a performance equal to or better than state of the art methods. The simplicity of the proposed ensemble system (to be called Loda) is particularly useful in domains where a large number of samples need to be processed in real-time or in domains where the data stream is subject to concept drift and the detector needs to be updated on-line. Besides being fast and accurate, Loda is also able to operate and update itself on data with missing variables. Loda is thus practical in domains with sensor outages. Moreover, Loda can identify features in which the scrutinized sample deviates from the majority. This capability is useful when the goal is to find out what has caused the anomaly. It should be noted that none of these favorable properties increase Loda's low time and space complexity. We compare Loda to several state of the art anomaly detectors in two settings: batch training and on-line training on data streams. The results on 36 datasets from UCI repository illustrate the strengths of the proposed system, but also provide more insight into the more general questions regarding batch-vs-on-line anomaly detection. Electronic supplementary materialThe online version of this article

show abstract

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Cited by 401 publications

References 16 publications

A framework of irregularity enlightenment for data pre-processing in data mining

A framework of irregularity enlightenment for data pre-processing in data mining

Density-preserving projections for large-scale local anomaly detection

Loda: Lightweight on-line detector of anomalies

Contact Info

Product

Resources

About