David A. Cieslak scite author profile

David A. Cieslak

5Publications

496Citation Statements Received

76Citation Statements Given

How they've been cited

883

489

How they cite others

Affiliations

ThermoAnalytics (UnitedStates), University of Notre Dame

Publications

Order By: Most citations

Learning Decision Trees for Unbalanced Data

View full text Add to dashboard Cite

Abstract. Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria -information gain, Gini measure, and DKM -and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.

show abstract

Hellinger distance decision trees are robust and skew-insensitive

Cieslak

Hoens

Chawla

et al. 2011

Data Min Knowl Disc

204

114

View full text Add to dashboard Cite

Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www. nd.edu/~dial/hddt).

show abstract

Automatically countering imbalance and its empirical relationship to cost

Chawla

Cieslak

Hall

et al. 2008

Data Min Knowl Disc

213

View full text Add to dashboard Cite

Learning from imbalanced datasets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in largescale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of resampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and cost dependent f-measure. Our analysis of the wrapper is two-fold. First, we report the interaction between different evaluation and wrapper optimization functions. Secondly, we present a set of results in a cost-sensitive environment, including scenarios of unknown or changing cost matrices. We also compare the performance of the wrapper approach versus cost-sensitive learning methods -MetaCost and Cost-Sensitive Classifiers -and find the wrapper to out-perform cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtain the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.

show abstract

Combating imbalance in network intrusion datasets

View full text Add to dashboard Cite

A Robust Decision Tree Algorithm for Imbalanced Data Sets

et al. 2010

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

David A. Cieslak

Learning Decision Trees for Unbalanced Data

Hellinger distance decision trees are robust and skew-insensitive

Automatically countering imbalance and its empirical relationship to cost

Combating imbalance in network intrusion datasets

A Robust Decision Tree Algorithm for Imbalanced Data Sets

Contact Info

Product

Resources

About