Battiti's mutual information feature selector (MIFS) and its variant algorithms are used for many classification applications. Since they ignore feature synergy, MIFS and its variants may cause a big bias when features are combined to cooperate together. Besides, MIFS and its variants estimate feature redundancy regardless of the corresponding classification task. In this paper, we propose an automated greedy feature selection algorithm called conditional mutual information-based feature selection (CMIFS). Based on the link between interaction information and conditional mutual information, CMIFS takes account of both redundancy and synergy interactions of features and identifies discriminative features. In addition, CMIFS combines feature redundancy evaluation with classification tasks. It can decrease the probability of mistaking important features as redundant features in searching process. The experimental results show that CMIFS can achieve higher best-classification-accuracy than MIFS and its variants, with the same or less (nearly 50%) number of features.Keywords: Classification, feature selection, conditional mutual information, redundancy, interaction. Manuscript received Apr. 20, 2010; revised June 13, 2010; accepted June 28, 2010
I. IntroductionFeature selection plays an important role in improving accuracy, efficiency, and scalability of the classification process. Since relevant features are often unknown a priori in the real world, irrelevant and redundant features are introduced to represent the domain. However, more features will significantly slow down the learning process and lead to classification over-fitting. With a limited amount of sample data, irrelevant features may obscure the distributions of a small set of truly relevant features for the task and confuse the learning algorithms. It has been proven in both theoretical and empirical aspects that reducing the number of irrelevant or redundant features drastically increases the learning efficiency of algorithms and yields more general concepts for a better insight into the classification tasks.In supervised classification learning, one is given a training set of labeled instances. An instance is typically described as an assignment of attribute values to a set of features F, and each instance is associated with one of l possible classes in C = {c 1 , …, c l }. The feature selection can be formalized by selecting a minimum subset S from the original feature set F such that P(C|S) is as close as possible to P(C|F), where P(C|S) and P(C|F) are the approximate conditional probability distribution given the training set [1]. The minimum subset S is called an optimal subset. To find the best subset, the order of the search space is O(2 n ), where n is the original number of features [2]. In practice, it is hard to search the feature subspace exhaustively because it is a huge number even for mediumsized n. A lot of problems related to feature selection are shown to be NP-hard [3]. Alternatively, many sequential-search-based approximation scheme...