Stephen D. Bay scite author profile

Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set..

show abstract

The UCI KDD archive of large data sets for data mining research and experimentation

Bay

Kibler

Pazzani

et al. 2000

SIGKDD Explor. Newsl.

203

106

View full text Add to dashboard Cite

Advances in data collection and storage have allowed organizations to create massive, complex and heterogeneous databases, which h a ve s t ymied traditional methods of data analysis. This has led to the development of new analytical tools that often combine techniques from a variety of elds such as statistics, computer science, and mathematics to extract meaningful knowledge from the data. To support research in this area, UC Irvine has created the UCI Knowledge Discovery in Databases KDD Archive

show abstract

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

2003

View full text Add to dashboard Cite

show abstract

Detecting change in categorical data

Bay

Pazzani

1999

143

View full text Add to dashboard Cite

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 versus 1998. We present the problem of mining contrast-sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups.We provide an algorithm for mining contrast-sets as well as several pruning rules to reduce the computational complexity.Once the deviations are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

show abstract

Nearest neighbor classification from multiple feature subsets

Bay

1999

IDA

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stephen D. Bay

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

The UCI KDD archive of large data sets for data mining research and experimentation

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Detecting change in categorical data

Nearest neighbor classification from multiple feature subsets

Contact Info

Product

Resources

About