Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Otey, Matthew Eric; Ghoting, Amol; Parthasarathy, S.

doi:10.1007/s10618-005-0014-6

Cited by 195 publications

(128 citation statements)

References 19 publications

Supporting

Mentioning

128

Contrasting

Order By: Relevance

“…In most applications, there is a combination of both continuous and categorical values. There are approaches that can combine the similarity of continuous attributes with similarity of categorical ones [Otey et al, 2006] [Tan et al, 2005]. Proximitybased methods can be classified into two groups: distance-based and densitybased methods.…”

Section: Proximity-based Methodsmentioning

confidence: 99%

Incremental anomaly detection using two-layer cluster-based structure

Bigdeli

Mohammadi

Raahemi

et al. 2018

Information Sciences

View full text Add to dashboard Cite

Anomaly detection algorithms face several challenges, including processing speed and dealing with noise in data. In this thesis, a two-layer clusterbased anomaly detection structure is presented which is fast, noise-resilient and incremental. In this structure, each normal pattern is considered as a cluster, and each cluster is represented using a Gaussian Mixture Model (GMM). Then, new instances are presented to the GMM to be labeled as normal or abnormal.The proposed structure comprises three main steps. In the first step, the data are clustered. The second step is to represent each cluster in a way that enables the model to classify new instances. The Summarization based on Gaussian Mixture Model (SGMM) proposed in this thesis represents each cluster as a GMM.In the third step, a two-layer structure efficiently updates clusters using In most real-time anomaly detection applications, incoming instances are often similar to previous ones. In these cases, there is no need to update clusters based on duplicates, since they have already been modeled in the cluster distribution. The two-layer structure is responsible for identifying redundant instances. In this structure, redundant instance are ignored, and the remaining new instances are used to update clusters. Ignoring redundant instances, which are typically in the majority, makes the detection phase fast.Each part of the general structure is validated in this thesis. The experiments include, detection rates, clustering goodness, time, memory usage and the complexity of the algorithms. The accuracy of the clustering and summarization of clusters using GMMs is evaluated, and compared to that of other methods. Using Davies-Bouldin (DB) and Dunn indexes, the distances for original and regenerated clusters using GMMs is almost zero with SGMM method while this value for ABACUS is around 0.01. Moreover, the results show that the SGMM algorithm is 3 times faster than ABACUS in running time, using one-third of the memory used by ABACUS.The CPL method, used to label new instances, is found to collectively remove the effect of noise, while increasing the accuracy of labeling new instances. In a noisy environment, the detection rate of the CPL method is 5% higher than other algorithms such as one-class SVM. The false alarm iii rate is decreased by 10% on average. Memory use is 20 times lesser that that of the one-class SVM.The proposed method is found to lower the false alarm rate, which is one of the basic problems for the one-class SVM. Experiments show the false alarm rate is decreased from 5% to 15% among different datasets, while the detection rate is increased from 5% to 10% in different datasets with twolayer structure. The memory usage for the two-layer structure is 20 to 50 times less than that of one-class SVM. One-class SVM uses support vectors in labeling new instances, while the labeling of the two-layer structure depends on the number of GMMs. The experiments show that the two-layer structure is 20 to 50 times faster than the one-class SVM in labelin...

show abstract

Section: Proximity-based Methodsmentioning

confidence: 99%

Incremental anomaly detection using two-layer cluster-based structure

Bigdeli

Mohammadi

Raahemi

et al. 2018

Information Sciences

View full text Add to dashboard Cite

show abstract

“…Other variants have been proposed for categorical attributes or a mixture of categorical and continuous attributes. Otey et al defined the anomaly score as the inverse of the sum of the link strength between the instance and the other instance in data sets [8]. The associated link strength is equal to the number of attribute-value pairs shared between two instances.…”

Section: A Definition Of Anomaly Scorementioning

confidence: 99%

“…Otey et al presented a tunable algorithm for distributed anomaly detection in mixed-attribute data sets [8]. They capture the link between the points in the mixed categorical and continuous attribute space.…”

Section: B Distance/similarity Measurementioning

confidence: 99%

A Review of Anomaly Detection Techniques Based on Nearest Neighbor

Zhao¹,

Chen²,

Li³

2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

Abstract-The concept of nearest neighbor has been used in several anomaly techniques, which supposes normal data instances occur in dense neighbors and anomalies occur far from their closest neighbors. So the techniques require a distance or similarity measure defined between two data instances. By now, there are several variants of basic technique extended by researchers in three different ways. The first set is to modify the definition of the anomaly score. The second set is to select different distance or density measure for different data type. The third set is to reduce the computation complexity. In this paper we have attempted to provide an overview of the previous work, although it is limited.

show abstract

“…Statistics-based approaches (see [2,3]) were first used for outlier detection based on an assumption that the distributions of datasets are known. A data point was defined as an outlier if it deviates from the existing distribution.…”

Section: Introductionmentioning

confidence: 99%

Rank-based outlier detection

Huang

Mehrotra

Mohan

2013

Journal of Statistical Computation and Simulation

View full text Add to dashboard Cite

ABSTRACT:We propose a new approach for outlier detection, based on a new ranking measure that focuses on the question of whether a point is "important" for its nearest neighbors; using our notations low cumulative rank implies the point is central. For instance, a point centrally located in a cluster has relatively low cumulative sum of ranks because it is among the nearest neighbors of its own nearest neighbors. But a point at the periphery of a cluster has high cumulative sum of ranks because its nearest neighbors are closer to the points. Use of ranks eliminates the problem of density calculation in the neighborhood of the point and this improves performance. Our method performs better than several density-based methods, on some synthetic data sets as well as on some real data sets. KEYWORDS: AbstractWe propose a new approach for outlier detection, based on a new ranking measure that focuses on the question of whether a point is "important" for its nearest neighbors; using our notations low cumulative rank implies the point is central. For instance, a point centrally located in a cluster has relatively low cummulative sum of ranks because it is among the nearest neighbors of its own nearest neighbors. But a point at the periphery of a cluster has high cummulative sum of ranks because its nearest neighbors are closer to the points. Use of ranks eliminates the problem of density calculation in the neighborhood of the point and this improves performance. Our method performs better than several density-based methods, on some synthetic data sets as well as on some real data sets.

show abstract

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Cited by 195 publications

References 19 publications

Incremental anomaly detection using two-layer cluster-based structure

Incremental anomaly detection using two-layer cluster-based structure

A Review of Anomaly Detection Techniques Based on Nearest Neighbor

Rank-based outlier detection

Contact Info

Product

Resources

About