Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and Random Forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.
Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation . This article proposes a method called Isolation Forest ( i Forest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods. As a result, i Forest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that i Forest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. i Forest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.
Stacked generalization is a general method of using a high-level model to combine lowerlevel models to achieve greater predictive accuracy. In this paper we address two crucial issues which h a v e been considered to be a`black art' in classi cation tasks ever since the introduction of stacked generalization in 1992 by W olpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We nd that best results are obtained when the higher-level model combines the con dence and not just the predictions of the lower-level ones.We demonstrate the e ectiveness of stacked generalization for combining three di erent types of learning algorithms for classi cation tasks. We also compare the performance of stacked generalization with majority v ote and published results of arcing and bagging.
MetaCost is a recently proposed procedure that converts an error-based learning algorithm into a cost-sensitive algorithm. This paper investigates two important issues centered on the procedure which were ignored in the paper proposing MetaCost. First, no comparison was made between MetaCost's final model and the internal cost-sensitive classifier on which MetaCost depends. It is credible that the internal cost-sensitive classifier may outperform the final model without the additional computation required to derive the final model. Second, MetaCost assumes its internal cost-sensitive classifier is obtained by applying a minimum expected cost criterion. It is unclear whether violation of the assumption has an impact on MetaCost's performance. We study these issues using two boosting procedures, and compare with the performance of the original form of MetaCost which employs bagging.
No abstract
A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on densitybased clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.