Abstract. The curse of dimensionality is a well known but not entirely well-understood phenomena. Too much data, in terms of the number of input variables, is not always a good thing. This is especially true when the problem involves unsupervised learning or supervised learning with unbalanced data (many negative observations but minimal positive observations). This paper addresses two issues involving high dimensional data: The first issue explores the behavior of kernels in high dimensional data. It is shown that variance, especially when contributed by meaningless noisy variables, confounds learning methods. The second part of this paper illustrates methods to overcome dimensionality problems with unsupervised learning utilizing subspace models. The modeling approach involves novelty detection with the one-class SVM.
IntroductionHigh dimensional data often create problems. This problem is exacerbated if the training data is only one class, unknown classes, or significantly unbalanced classes. Consider a binary classification problem that involves computer intrusion detection. Our intention is to classify network traffic, and we are interested in classifying the traffic as either attacks (intruders) or non attacks. Capturing network traffic is simple -hookup to a LAN cable, run tcpdump, and you can fill a hard drive within minutes. These captured network connections can be described with attributes; it is not uncommon for a network connection to be described with over 100 attributes [14]. However, the class of each connection will be unknown, or perhaps with reasonable confidence we can assume that all of the connections do not involve any attacks. The above scenario can be generalized to other security problems as well. Given a matrix of data, X, containing N observations and m attributes, we are interested in classifying this data as either potential attackers (positive class) or non attackers (negative class). If m is large, and our labels, y ∈ R N ×1 , are unbalanced (usually plenty of known non attackers and few instances of attacks), one class (all non attackers), or unknown, increased dimensionality rapidly becomes a problem and feature selection is not feasible due to the minimal examples (if any) of the attacker class.