Erich Schubert scite author profile

High‐dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so‐called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high‐dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high‐dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012

show abstract

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

Campos

Zimek

Sander

et al. 2016

Data Min Knowl Disc

526

355

View full text Add to dashboard Cite

DBSCAN Revisited, Revisited

Schubert

Sander

Ester

et al. 2017

ACM Trans. Database Syst.

1,471

346

View full text Add to dashboard Cite

At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.

show abstract

Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

et al. 2009

View full text Add to dashboard Cite

Abstract. We propose an original outlier detection schema that detects outliers in varying subspaces of a high dimensional feature space. In particular, for each object in the data set, we explore the axis-parallel subspace spanned by its neighbors and determine how much the object deviates from the neighbors in this subspace. In our experiments, we show that our novel subspace outlier detection is superior to existing fulldimensional approaches and scales well to high dimensional databases.

show abstract

Interpreting and Unifying Outlier Scores

Kriegel¹,

Kröger²,

Schubert³

et al. 2011

195

153

View full text Add to dashboard Cite

Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary "outlier factors" to values in the range [0,1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection.

show abstract

On Evaluation of Outlier Rankings and Outlier Scores

et al. 2012

View full text Add to dashboard Cite

Outlier detection research is currently focusing on the development of new methods and on improving the computation time for these methods. Evaluation however is rather heuristic, often considering just precision in the top k results or using the area under the ROC curve. These evaluation procedures do not allow for assessment of similarity between methods. Judging the similarity of or correlation between two rankings of outlier scores is an important question in itself but it is also an essential step towards meaningfully building outlier detection ensembles, where this aspect has been completely ignored so far. In this study, our generalized view of evaluation methods allows both to evaluate the performance of existing methods as well as to compare different methods w.r.t. their detection performance. Our new evaluation framework takes into consideration the class imbalance problem and offers new insights on similarity and redundancy of existing outlier detection methods. As a result, the design of effective ensemble methods for outlier detection is considerably enhanced.

show abstract

Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Schubert

Zimek

Kriegel

2012

Data Min Knowl Disc

244

147

View full text Add to dashboard Cite

Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?

et al. 2010

View full text Add to dashboard Cite

Abstract. The performance of similarity measures for search, indexing, and data mining applications tends to degrade rapidly as the dimensionality of the data increases. The effects of the so-called 'curse of dimensionality' have been studied by researchers for data sets generated according to a single data distribution. In this paper, we study the effects of this phenomenon on different similarity measures for multiplydistributed data. In particular, we assess the performance of sharedneighbor similarity measures, which are secondary similarity measures based on the rankings of data objects induced by some primary distance measure. We find that rank-based similarity measures can result in more stable performance than their associated primary distance measures.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Erich Schubert

A survey on unsupervised outlier detection in high‐dimensional numerical data

On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

DBSCAN Revisited, Revisited

Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

Interpreting and Unifying Outlier Scores

On Evaluation of Outlier Rankings and Outlier Scores

Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?

Contact Info

Product

Resources

About