Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Bay, Stephen D.; Schwabacher, Mark

doi:10.1145/956755.956758

Cited by 195 publications

(102 citation statements)

References 0 publications

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…This gives a summary classification of all existing detection techniques. -We demonstrate a huge improvement in execution time by using multiple pruning rules in two phases, compared with outstanding existing nested-loop distance-based methods, ORCA [11] and RBRP [12]. Since ORCA, RBRP and MIRO use the same notion of outlier (Section 2), outliers identified by the three techniques are exactly the same.…”

Section: Introductionmentioning

confidence: 93%

“…Here two pruning rules are utilized: a) first triangular inequality on the data point's outlier score is used, and then b) the outlier score is compared with the minimum score required to be an outlier. The second check is similar to that of ORCA [11]. However, while ORCA starts with a cutoff of 0, in MIRO the initial cutoff is obtained from the first phase, and hence converges faster.…”

Section: Introductionmentioning

confidence: 93%

“…Among these, approaches for pruning the outlier search space and distance computation reduction tech-niques are dominant. Computation reduction approaches [7,12,11,6] usually fix the desired number of outliers to a certain value (e.g., top n outliers), and deploy data structures similar to those used in Ramaswamy's index-based algorithm.…”

Section: Related Workmentioning

confidence: 99%

“…This leads to high execution times and has motivated many attempts to produce efficient algorithms to mine outliers. Among them, outstanding work by Bay and Schwabacher [11] and Ghoting et al [12] aim to reduce execution time by utilizing a simple pruning nested-loop algorithm.…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we employ a global outlier function based on [6], although the ideas employed in MIRO can also be adapted to use other functions. The intuition and quality of detection results of the chosen outlier definition are based on solid foundations as shown by prior work [6,11]. This definition is also employed in other popular techniques on outlier detection [12].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Pruning Schemes for Distance-Based Outlier Detection

Gopalkrishnan

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Outlier detection finds many applications, especially in domains that have scope for abnormal behavior. In this paper, we present a new technique for detecting distance-based outliers, aimed at reducing execution time associated with the detection process. Our approach operates in two phases and employs three pruning rules. In the first phase, we partition the data into clusters, and make an early estimate on the lower bound of outlier scores. Based on this lower bound, the second phase then processes relevant clusters using the traditional block nested-loop algorithm. Here two efficient pruning rules are utilized to quickly discard more non-outliers and reduce the search space. Detailed analysis of our approach shows that the additional overhead of the first phase is offset by the reduction in cost of the second phase. We also demonstrate the superiority of our approach over existing distance-based outlier detection methods by extensive empirical studies on real datasets.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 93%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient Pruning Schemes for Distance-Based Outlier Detection

Gopalkrishnan

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

On the effectiveness of isolation‐based anomaly detection in cloud data centers

Calheiros

Ramamohanarao

Buyya

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The high volume of monitoring information generated by large‐scale cloud infrastructures poses a challenge to the capacity of cloud providers in detecting anomalies in the infrastructure. Traditional anomaly detection methods are resource‐intensive and computationally complex for training and/or detection, what is undesirable in very dynamic and large‐scale environment such as clouds. Isolation‐based methods have the advantage of low complexity for training and detection and are optimized for detecting failures. In this work, we explore the feasibility of Isolation Forest, an isolation‐based anomaly detection method, to detect anomalies in large‐scale cloud data centers. We propose a method to code time‐series information as extra attributes that enable temporal anomaly detection and establish its feasibility to adapt to seasonality and trends in the time‐series and to be applied online and in real‐time.

show abstract

Building a scientific workflow framework to enable real‐time machine learning and visualization

Song

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Nowadays, we have entered the era of big data. In the area of high performance computing, large‐scale simulations can generate huge amounts of data with potentially critical information. However, these data are usually saved in intermediate files and are not instantly visible until advanced data analytics techniques are applied after reading all simulation data from persistent storages (eg, local disks or a parallel file system). This approach puts users in a situation where they spend long time on waiting for running simulations while not knowing the status of the running job. In this paper, we build a new computational framework to couple scientific simulations with multi‐step machine learning processes and in‐situ data visualizations. We also design a new scalable simulation‐time clustering algorithm to automatically detect fluid flow anomalies. This computational framework is built upon different software components and provides plug‐in data analysis and visualization functions over complex scientific workflows. With this advanced framework, users can monitor and get real‐time notifications of special patterns or anomalies from ongoing extreme‐scale turbulent flow simulations.

show abstract

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Cited by 195 publications

References 0 publications

Efficient Pruning Schemes for Distance-Based Outlier Detection

Efficient Pruning Schemes for Distance-Based Outlier Detection

On the effectiveness of isolation‐based anomaly detection in cloud data centers

Building a scientific workflow framework to enable real‐time machine learning and visualization

Contact Info

Product

Resources

About