An Information-Theoretic Approach for Setting the Optimal Number of Decision Trees in Random Forests

Cuzzocrea, Alfredo; Francis, Shane Leo; Gaber, Mohamed Medhat

doi:10.1109/smc.2013.177

Cited by 20 publications

(8 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To optimize the number of trees while keeping the classification accuracy close to or higher than that of the original RF algorithm, Cuzzocrea et al proposed a new algorithm 7end while (8) end procedure Algorithm 1: Overall Operation. [18]. Based on the relationship between the predictive power which means the percentage of positively classified cases of instances of that dataset and the number of trees in a forest, they proposed how to optimize the number of trees in RF using an information-theoretic approach.…”

Section: Related Workmentioning

confidence: 99%

“…for all ∈ do (4) Count either , or from the return (5) value of Test4Classification( , ); (6) end for (7) = /( + ); (8) = /( + ); (9) = F Measure( , ); (10) ← ; (11) else (12) for all ∈ do (13) for all 1 ≤ ≤ do (14) Count either , or from the return (15) value of Test4Classification( , ); (16) end for (17) end for (18) = /( + ); (19) = node based on the best attribute is included into (Line (6)). Based on the best attribute, C4.5 splits and, thus, generates .…”

Section: Oob Dataset Dmentioning

confidence: 99%

“…To find the optimal value of , Cuzzocrea et al proposed an algorithm that gradually increased the number of trees [18] and found the best classifiers than the others. Even though Cuzzocrea et al 's algorithm minimizes the number of trees composing the forest, it requires much learning time and much memory usage.…”

Section: Introductionmentioning

confidence: 99%

“…While keeping the best classification accuracy close to that of the original RF algorithm [12], the proposed algorithm reduces the number of trees composing the forest. Compared to Cuzzocrea et al 's algorithm [18], the proposed algorithm does not need to know the predefined number of trees and a certain a priori information. Also, compared to P. Latinne et al 's algorithm [19], the proposed algorithm uses the covariance between the number of trees and -measure for a forest at each iteration.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Dynamic Nonparametric Random Forest Using Covariance

Choi

Shin

Choi

2019

Security and Communication Networks

View full text Add to dashboard Cite

As the representative ensemble machine learning method, the Random Forest (RF) algorithm has widely been used in diverse applications on behalf of the fast learning speed and the high classification accuracy. Research on RF can be classified into two categories: (1) improving the classification accuracy and (2) decreasing the number of trees in a forest. However, most of papers related to the performance improvement of RF have focused on improving the classification accuracy. Only some papers have focused on reducing the number of trees in a forest. In this paper, we propose a new Covariance-Based Dynamic RF algorithm, called C-DRF. Compared to the previous works, while ensuring the good-enough classification accuracy, the proposed C-DRF algorithm reduces the number of trees. Specifically, by computing the covariance between the number of trees in a forest and F-measure at each iteration, the proposed algorithm determines whether to increase the number of trees composing a forest. To evaluate the performance of the proposed C-DRF algorithm, we compared the learning time, the test time, and the memory usage with the original RF algorithm under the different areas of datasets. Under the same or higher classification accuracy, it is shown that the proposed C-DRF algorithm improves the performance of the original RF algorithm by as much as 58.68% at learning time, 47.91% at test time, and 68.06% in memory usage on average. As a practical application area, we also show that the proposed C-DRF algorithm is more efficient than the state-of-the-art RF algorithms in Network Intrusion Detection (NID) area.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Oob Dataset Dmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dynamic Nonparametric Random Forest Using Covariance

Choi

Shin

Choi

2019

Security and Communication Networks

View full text Add to dashboard Cite

show abstract

“…Accuracy is simply measured by the possibility that the algorithm can predict negative and positive instances correctly. 34,35 as:…”

Section: Performance Measuresmentioning

confidence: 99%

Resampling Imbalanced Class and the Effectiveness of Feature Selection Methods for Heart Failure Dataset

Khaldy

Kambhampati

2018

IRATJ

View full text Add to dashboard Cite

The real dataset has many shortcomings that pose challenges to machine learning. High dimensional and imbalanced class prevalence is two important challenges. Hence, the classification of data is negatively impacted by imbalanced data, and high dimensional could create suboptimal performance of the classifier. In this paper, we explore and analyse different feature selection methods for a clinical dataset that suffers from high dimensional and imbalance data. The aim of this paper is to investigate the effect of imbalanced data on selecting features by implementing the feature selection methods to select a subset of the original data and then resample the dataset. In addition, we resample the dataset to apply feature selection methods on a balanced class to compare the results with the original data. Random forest and J48 techniques were used to evaluate the efficacy of samples. The experiments confirm that resampling imbalanced class obtains a significant increase in classification performance, for both taxonomy methods Random forest and J48. Furthermore, the biggest measure affected by balanced data is specificity where it is sharply increased for all methods. What is more, the subsets selected from the balanced data just improve the performance for information gain, where it is played down for the performance of others.

show abstract

Correlation-Based Feature Selection for Enhanced Arrhythmia Classification

Al Khaldy

2024

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

An Information-Theoretic Approach for Setting the Optimal Number of Decision Trees in Random Forests

Cited by 20 publications

References 13 publications

Dynamic Nonparametric Random Forest Using Covariance

Dynamic Nonparametric Random Forest Using Covariance

Resampling Imbalanced Class and the Effectiveness of Feature Selection Methods for Heart Failure Dataset

Correlation-Based Feature Selection for Enhanced Arrhythmia Classification

Contact Info

Product

Resources

About