Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure

Ting, Kai Ming; Zhu, Ye; Carman, Mark James; Zhu, Yue; Zhou, Zhi‐Hua

doi:10.1145/2939672.2939779

Cited by 49 publications

(46 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, mass-based dissimilarity measures 36,37 have been shown to outperform distance measures using the same NN algorithms in classification, clustering, anomaly detection, and information retrieval tasks. First, mass-based dissimilarity measures 36,37 have been shown to outperform distance measures using the same NN algorithms in classification, clustering, anomaly detection, and information retrieval tasks.…”

Section: Discussionmentioning

confidence: 99%

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Bandaragoda

Ting

Albrecht

et al. 2018

Computational Intelligence

Self Cite

113

View full text Add to dashboard Cite

The first successful isolation-based anomaly detector, ie, iForest, uses trees as a means to perform isolation. Although it has been shown to have advantages over existing anomaly detectors, we have identified 4 weaknesses, ie, its inability to detect local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis-parallel clusters, and anomalies in multimodal data sets. To overcome these weaknesses, this paper shows that an alternative isolation mechanism is required and thus presents iNNE or isolation using Nearest Neighbor Ensemble.Although relying on nearest neighbors, iNNE runs significantly faster than the existing nearest neighbor-based methods such as the local outlier factor, especially in data sets having thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant space complexity. KEYWORDSanomaly detection, ensemble learning, isolation-based, nearest neighbor, outlier detection INTRODUCTIONAnomaly detection is an important data mining task that has a diverse range of applications in various domains. 1,2 The explosive growth of databases in both size and dimensionality is challenging for anomaly detection methods in two important aspects: the requirement of low computational 968 /journal/coin Computational Intelligence. 2018;34:968-998. BANDARAGODA ET AL. 969cost and the susceptibility to issues in high-dimensional data sets. Efficient methods are required in time-critical applications such as network intrusion detection and credit card fraud detection. However, the time complexity of most existing methods is on the order of O(n 2 ) (where n is the data set size), which is prohibitively expensive for large data sets. Therefore, efficient and scalable methods for large data sets are highly desirable.iForest 3 is a unique anomaly detector because it utilizes an isolation mechanism to detect anomalies. iForest isolates each instance from the rest of the instances through recursive axis-parallel subdivisions. Those instances that can be easily isolated are likely to be anomalies.The key advantage of iForest is its linear execution time, which makes it extremely efficient in comparison to other methods, and thus, it is a very attractive option for large data sets. iForest has been shown 3,4 to have better detection accuracy and faster runtime than many state-of-the-art methods including the local outlier factor (LOF) 5 and optimal reciprocal collision avoidance. 6 Despite these advantages, our investigation finds that the current isolation mechanism has weaknesses in detecting the following 4 types of anomalies.1. Local anomalies: iForest uses a global anomaly score that is not sensitive to the local data distribution of a data set. 2. Anomalies with low relevant dimensions: In high-dimensional data, iForest can only utilize a subset of the dimensions to create isolation trees. Each subset does not usually contain sufficient relevant dimensions to detect anomalies when the number of relevant dimens...

show abstract

Section: Discussionmentioning

confidence: 99%

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Bandaragoda

Ting

Albrecht

et al. 2018

Computational Intelligence

Self Cite

113

View full text Add to dashboard Cite

show abstract

“…Previous works [7,28] have shown that random partitions of data can be used to compute a similarity between the instances. In particular, in Unsupervised Extremely Randomized Trees (UET), the idea is that all instances ending up in the same leaves are more similar to each other than to other instances.…”

Section: Methodsmentioning

confidence: 99%

“…The intuition behind our proposed method, GT, is to leverage a similar partition in the vertices of a graph. Instead of using the similarity computation that we described previously, we chose to use the mass-based approach introduced by Ting et al [28] instead. The key property of their measure is that the dissimilarity between two instances in a dense region is higher than the same interpoint dissimilarity between two instances in a sparse region of the same space.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Computing Vertex-Vertex Dissimilarities Using Random Trees: Application to Clustering in Graphs

Dalleau

Couceiro

Smaïl-Tabbone

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

A current challenge in graph clustering is to tackle the issue of complex networks, i.e, graphs with attributed vertices and/or edges. In this paper, we present GraphTrees, a novel method that relies on random decision trees to compute pairwise dissimilarities between vertices in a graph. We show that using different types of trees, it is possible to extend this framework to graphs where the vertices have attributes. While many existing methods that tackle the problem of clustering vertices in an attributed graph are limited to categorical attributes, GraphTrees can handle heterogeneous types of vertex attributes. Moreover, unlike other approaches, the attributes do not need to be preprocessed. We also show that our approach is competitive with well-known methods in the case of non-attributed graphs in terms of quality of clustering, and provides promising results in the case of vertex-attributed graphs. By extending the use of an already well established approach-the random trees-to graphs, our proposed approach opens new research directions, by leveraging decades of research on this topic.

show abstract

“…They are sampled from the data distribution of each class. In consequence, the trees are still learning an abstraction of the data, using the trees as a density estimator [46].…”

Section: How To Learn a Proximity Forest?mentioning

confidence: 99%

Proximity Forest: an effective and scalable distance-based classifier for time series

Lucas

Shifaz

Pelletier

et al. 2019

Data Min Knowl Disc

137

View full text Add to dashboard Cite

Research into the classification of time series has made enormous progress in the last decade. The UCR time series archive has played a significant role in challenging and guiding the development of new learners for time series classification. The largest dataset in the UCR archive holds 10 thousand time series only; which may explain why the primary research focus has been on creating algorithms that have high accuracy on relatively small datasets.This paper introduces Proximity Forest, an algorithm that learns accurate models from datasets with millions of time series, and classifies a time series in milliseconds. The models are ensembles of highly randomized Proximity Trees. Whereas conventional decision trees branch on attribute values (and usually perform poorly on time series), Proximity Trees branch on the proximity of time series to one exemplar time series or another; allowing us to leverage the decades of work into developing relevant measures for time series. Proximity Forest gains both efficiency and accuracy by stochastic selection of both exemplars and similarity measures.Our work is motivated by recent time series applications that provide orders of magnitude more time series than the UCR benchmarks. Our experiments demonstrate that Proximity Forest is highly competitive on the UCR archive: it ranks among the most accurate classifiers while being significantly faster. We demonstrate on a 1M time series Earth observation dataset that Proximity Forest retains this accuracy on datasets that are many orders of magnitude greater than those in the UCR repository, while learning its models at least 100,000 times faster than current state of the art models Elastic Ensemble and COTE.

show abstract

Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure

Cited by 49 publications

References 19 publications

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Isolation‐based anomaly detection using nearest‐neighbor ensembles

Computing Vertex-Vertex Dissimilarities Using Random Trees: Application to Clustering in Graphs

Proximity Forest: an effective and scalable distance-based classifier for time series

Contact Info

Product

Resources

About