A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Chen, Jianguo; Li, Kenli; Tang, Zhuo; Bilal, Kashif; Yu, Shui; Weng, Chuliang

doi:10.1109/tpds.2016.2603511

Cited by 359 publications

(115 citation statements)

References 28 publications

Supporting

Mentioning

112

Contrasting

Unclassified

Order By: Relevance

“…2) Random Forest Algorithm: Random Forest algorithm is an ensemble classifier algorithm which uses ‗bagging' to create multiple decision trees and classifies new incoming data instance to a class or group [11]. The trees are built not pruned [12].…”

Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning

confidence: 99%

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Singh¹,

Halgamuge²,

Lakshmiganthan³

2017

ijacsa

View full text Add to dashboard Cite

Abstract-This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.

show abstract

Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning

confidence: 99%

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Singh¹,

Halgamuge²,

Lakshmiganthan³

2017

ijacsa

View full text Add to dashboard Cite

show abstract

“…The imaging spectrometer acquiring the hyperspectral image data cannot be directly applied and classified analysis, which needs to be analyzed. Therefore, the preprocessing of hyperspectral remote sensing image in general includes atmospheric radiation correction, geometry correction, and noise removal [45,[55][56][57]. In the preprocessing of hyperspectral image, radiometric correction is the mainly steps.…”

Section: Methodsmentioning

confidence: 99%

A New Semisupervised-Entropy Framework of Hyperspectral Image Classification Based on Random Forest

Sun

Wang

et al. 2018

Advances in Multimedia

View full text Add to dashboard Cite

The purposes of the algorithm presented in this paper are to select features with the highest average separability by using the random forest method to distinguish categories that are easy to distinguish and to select the most divisible features from the most difficult categories using the weighted entropy algorithm. The framework is composed of five parts: (1) random samples selection with (2) probabilistic output initial random forest classification processing based on the number of votes; (3) semisupervised classification, which is an improvement of the supervision classification of random forest based on the weighted entropy algorithm; (4) precision evaluation; and (5) a comparison with the traditional minimum distance classification and the support vector machine (SVM) classification. In order to verify the universality of the proposed algorithm, two different data sources are tested, which are AVIRIS and Hyperion data. The results show that the overall classification accuracy of AVIRIS data is up to 87.36%, the kappa coefficient is up to 0.8591, and the classification time is 22.72s. Hyperion data is up to 99.17%, the kappa coefficient is up to 0.9904, and the classification time is 8.16s. Classification accuracy is obviously improved and efficiency is greatly improved, compared with the minimum distance and the SVM classifier and the CART classifier.

show abstract

“…The bat algorithm combined the major advantages between particle swarm optimization and genetic algorithm and Harmony Search is applied to yield optimal parameters in the DBN. Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32].…”

Section: Mathematical Problems In Engineeringmentioning

confidence: 99%

“…Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32]. Owing to the improvement of the forecasting accuracy for highdimensional and large-scale wind turbine data, we propose an optimized random forest method which consists of a dimension reduction procedure and the weighted voting process for the short-term WPF.…”

Section: Mathematical Problems In Engineeringmentioning

confidence: 99%

See 1 more Smart Citation

Multistep Wind Speed and Wind Power Prediction Based on a Predictive Deep Belief Network and an Optimized Random Forest

Sun

Zhang

2018

Mathematical Problems in Engineering

View full text Add to dashboard Cite

A variety of supervised learning methods using numerical weather prediction (NWP) data have been exploited for short-term wind power forecasting (WPF). However, the NWP data may not be available enough due to its uncertainties on initial atmospheric conditions. Thus, this study proposes a novel hybrid intelligent method to improve existing forecasting models such as random forest (RF) and artificial neural networks, for higher accuracy. First, the proposed method develops the predictive deep belief network (DBN) to perform short-term wind speed prediction (WSP). Then, the WSP data are transformed into supplementary input features in the prediction process of WPF. Second, owing to its ensemble learning and parallelization, the random forest is used as supervised forecasting model. In addition, a data driven dimension reduction procedure and a weighted voting method are utilized to optimize the random forest algorithm in the training process and the prediction process, respectively. The increasing number of training samples would cause the overfitting problem. Therefore, the k-fold cross validation (CV) technique is adopted to address this issue. Numerical experiments are performed at 15-min, 30-min, 45-min, and 24-h to indicate the superiority and signal advantages compared with existing methods in terms of forecasting accuracy and scalability.

show abstract

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Cited by 359 publications

References 28 publications

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

A New Semisupervised-Entropy Framework of Hyperspectral Image Classification Based on Random Forest

Multistep Wind Speed and Wind Power Prediction Based on a Predictive Deep Belief Network and an Optimized Random Forest

Contact Info

Product

Resources

About