2012
DOI: 10.1002/sam.11161
|View full text |Cite
|
Sign up to set email alerts
|

A survey on unsupervised outlier detection in high‐dimensional numerical data

Abstract: High‐dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so‐called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high‐dimensional data in Eu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
413
0
9

Year Published

2014
2014
2022
2022

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 685 publications
(453 citation statements)
references
References 138 publications
5
413
0
9
Order By: Relevance
“…We are aware that these artificial data cubes are not real simulations of Earth system data cubes. However, relying on artificial data in this paper is motivated by the fact that a meaningful quantitative evaluation of unsupervised anomaly detection algorithms and feature extraction techniques in real Earth observation data is difficult due to the lack of ground-truth data (Zimek et al, 2012). Second, we use these artificial data to evaluate the capability of different algorithms to detect multivariate anomalous events, including compound events (e.g., events in which none of the single variables are extreme, but their joint distribution is anomalous and might lead to an extreme impact) (Seneviratne et al, 2012;Leonard et al, 2013).…”
Section: Introductionmentioning
confidence: 99%
“…We are aware that these artificial data cubes are not real simulations of Earth system data cubes. However, relying on artificial data in this paper is motivated by the fact that a meaningful quantitative evaluation of unsupervised anomaly detection algorithms and feature extraction techniques in real Earth observation data is difficult due to the lack of ground-truth data (Zimek et al, 2012). Second, we use these artificial data to evaluate the capability of different algorithms to detect multivariate anomalous events, including compound events (e.g., events in which none of the single variables are extreme, but their joint distribution is anomalous and might lead to an extreme impact) (Seneviratne et al, 2012;Leonard et al, 2013).…”
Section: Introductionmentioning
confidence: 99%
“…Column 6 (IPFX IE) shows the IPFIX information elements that can be used to generate this feature or explains which additional steps (comparison of IEs or state keeping of connections) are necessary variable, thus vectors become sparse and dissimilar in the huge universe and the exploration by classifiers becomes harder. This problem has a strong effect on some classification techniques, yet in clustering and anomaly detection in general (Zimek et al 2012). The performance degradation caused by irrelevant or redundant features varies depending on the classification technique.…”
Section: Problems Of High-dimensionality For Classificationmentioning
confidence: 99%
“…We assume that in expectation, the Mahalanobis distance δ k (x i ) becomes a constant factor, which makes the Fisher score per dimensional independent. This assumption is based on the concentration of distances theorem [31] which states that for high dimensional data the proportional distance difference between any point and the mean of all data points vanishes. Intuitively, this theorem states that the distance differences δ k (x i ), for k = {1, .…”
Section: Analytical Approximation Of the Fisher Information Matrixmentioning
confidence: 99%