Many decision tree (DT) induction algorithms, including the popular C4.5 family, are based on the Conditional Entropy (CE) measure family. An interesting question involves the relative performance of other entropy measure families such as Class-Attribute Mutual Information (CAMI). We therefore conducted a theoretical analysis of the CAMI family that enabled us to expose relationships with CE and correct a previous CAMI result. Our computational study showed that there was only a small variation in the performance of the two families.Since feature selection is important in DT induction, we conducted a theoretical analysis of a recently published blurring-based feature selection algorithm and developed a new feature selection algorithm. We tested this algorithm on a wider set of test problems than in the comparable study in order to identify benefits and limitations of blurring-based feature selection.These results provide theoretical and computational insight into entropy-based induction measures and feature selection algorithms.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper, we develop and explore hybrid splitting methods from two entropybased families: the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI). We compare conventional splitting methods based on single measures with hybrid splitting methods based on multiple measures. The results suggest that the hybrid methods could be competitive in terms of classification accuracy and are thus worthy of future research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.