Abstract:With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performe… Show more
“…2) Random Forest Algorithm: Random Forest algorithm is an ensemble classifier algorithm which uses ‗bagging' to create multiple decision trees and classifies new incoming data instance to a class or group [11]. The trees are built not pruned [12].…”
Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning
Abstract-This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.
“…2) Random Forest Algorithm: Random Forest algorithm is an ensemble classifier algorithm which uses ‗bagging' to create multiple decision trees and classifies new incoming data instance to a class or group [11]. The trees are built not pruned [12].…”
Section: B Algorithms 1) K-nearest Neighbors Algorithm: the K-nearestmentioning
Abstract-This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.
“…The imaging spectrometer acquiring the hyperspectral image data cannot be directly applied and classified analysis, which needs to be analyzed. Therefore, the preprocessing of hyperspectral remote sensing image in general includes atmospheric radiation correction, geometry correction, and noise removal [45,[55][56][57]. In the preprocessing of hyperspectral image, radiometric correction is the mainly steps.…”
The purposes of the algorithm presented in this paper are to select features with the highest average separability by using the random forest method to distinguish categories that are easy to distinguish and to select the most divisible features from the most difficult categories using the weighted entropy algorithm. The framework is composed of five parts: (1) random samples selection with (2) probabilistic output initial random forest classification processing based on the number of votes; (3) semisupervised classification, which is an improvement of the supervision classification of random forest based on the weighted entropy algorithm; (4) precision evaluation; and (5) a comparison with the traditional minimum distance classification and the support vector machine (SVM) classification. In order to verify the universality of the proposed algorithm, two different data sources are tested, which are AVIRIS and Hyperion data. The results show that the overall classification accuracy of AVIRIS data is up to 87.36%, the kappa coefficient is up to 0.8591, and the classification time is 22.72s. Hyperion data is up to 99.17%, the kappa coefficient is up to 0.9904, and the classification time is 8.16s. Classification accuracy is obviously improved and efficiency is greatly improved, compared with the minimum distance and the SVM classifier and the CART classifier.
“…The bat algorithm combined the major advantages between particle swarm optimization and genetic algorithm and Harmony Search is applied to yield optimal parameters in the DBN. Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32].…”
Section: Mathematical Problems In Engineeringmentioning
confidence: 99%
“…Second, random forest is suitable for handling large data due to its parallelization [28]. It has been combined with the Spark [28], heuristic bootstrap sampling method [29], kernel principal component analysis [30], and other technologies to perform fault diagnosis and regression tasks [31,32]. Owing to the improvement of the forecasting accuracy for highdimensional and large-scale wind turbine data, we propose an optimized random forest method which consists of a dimension reduction procedure and the weighted voting process for the short-term WPF.…”
Section: Mathematical Problems In Engineeringmentioning
confidence: 99%
“…Compute the final prediction result based on the prediction values from all regression trees according to (28), and then update the error weight for each regression tree in real-time according to (30)-(31).…”
Section: The Short-term Wind Power Forecasting Modelmentioning
A variety of supervised learning methods using numerical weather prediction (NWP) data have been exploited for short-term wind power forecasting (WPF). However, the NWP data may not be available enough due to its uncertainties on initial atmospheric conditions. Thus, this study proposes a novel hybrid intelligent method to improve existing forecasting models such as random forest (RF) and artificial neural networks, for higher accuracy. First, the proposed method develops the predictive deep belief network (DBN) to perform short-term wind speed prediction (WSP). Then, the WSP data are transformed into supplementary input features in the prediction process of WPF. Second, owing to its ensemble learning and parallelization, the random forest is used as supervised forecasting model. In addition, a data driven dimension reduction procedure and a weighted voting method are utilized to optimize the random forest algorithm in the training process and the prediction process, respectively. The increasing number of training samples would cause the overfitting problem. Therefore, the k-fold cross validation (CV) technique is adopted to address this issue. Numerical experiments are performed at 15-min, 30-min, 45-min, and 24-h to indicate the superiority and signal advantages compared with existing methods in terms of forecasting accuracy and scalability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.