A parallel C4.5 decision tree algorithm based on MapReduce

Self Cite

Fuzzy decision trees are one of the most important extensions of decision trees for symbolic knowledge acquisition by fuzzy representation. Many fuzzy decision trees employ fuzzy information gain as a measure to construct the tree node splitting criteria. These criteria play a critical role in the construction of decision trees. However, many of the criteria can only work well on small-scale or medium-scale data sets, and cannot directly deal with large-scale data sets on the account of some limiting factors such as memory capacity, execution time, and data complexity.Parallel computing is one way to overcome these problems; in particular, MapReduce is one mainstream solution of parallel computing. In this paper, we design a parallel tree node splitting criterion (MR-NSC) based on fuzzy information gain via MapReduce, which is completed equivalent to the traditional unparallel splitting rule. The experimental studies verify the equivalency between the proposed MR-NSC algorithm and the traditional unparallel way through 22 UCI benchmark data sets. Furthermore, the feasibility and parallelism are also studied on two large-scale data sets. KEYWORDSfuzzy decision trees, fuzzy information gain, MapReduce, parallel computing INTRODUCTIONDecision trees are one of the most well-known researches to describe a decision-making process in light of existing knowledge. Each branch of a decision tree can be transformed into a decision rule, and all these decision rules can generate a decision rule base. The popularity of decision trees mainly arises from that the decision rules are more readily comprehensible than some other decision-making models such as neural networks. 1,2 Based on decision trees, there are many extensions 3 : Most of them are extensions or improvements of the well-known ID3 algorithm 4 and CART algorithm. 5 Fuzzy decision trees are one of the most popular extensions. They combine the symbolic decision trees with approximate reasoning offered by fuzzy representation. 6 The intent is to exploit the complementary advantages of the comprehensibility of decision trees and the uncertain information of fuzzy representation. 7,8Based on some splitting mechanisms, fuzzy decision trees recursively partition the training data into several subsets with some similar or same outputs in a top-down way. In particular, as one of the splitting mechanisms, fuzzy information theory makes a widespread influence in the growth of fuzzy decision trees. Weber 9 presented a well-known fuzzy Iterative Dichotomiser 3 (ID3) algorithm by modifying the information gain measure used to split a tree node for fuzzy representation. Umanol et al 10 proposed a new algorithm based on the probability of membership values to generate a fuzzy decision tree from numerical data, in which the fuzzy sets for each attribute are predefined by users. Ichihashi et al 11 realized fuzzy partitions by extracting fuzzy reasoning rules. An algebraic method to facilitate incremental learning is also employed. As knowledge inferences must be newly defined in fuz...

Section: The Mechanics Of Mapreducementioning

confidence: 99%

A parallel tree node splitting criterion for fuzzy decision trees

Liu

Wang

et al. 2019

Self Cite

“…Then, these intermediate outputs are combined in the Reduce phase via a reduce function by some way to export the final results . Figure presents the detailed processing procedure of Map‐Reduce framework …”

Section: Related Workmentioning

confidence: 99%

“…When the best splitting attribute

A_{k^{*}}

and the splitting point

c_{k^{*}}

are confirmed by Algorithm 4, we need to split the data set into subsets in parallel for large‐scale data sets, which can follow the MR‐D‐S algorithm . In the following, how to construct an MR‐FRMIDT is introduced.…”

Section: The Parallel Fast Rank Mutual Information Based Decision Treementioning

confidence: 99%

“…54 Figure 1 presents the detailed processing procedure of Map-Reduce framework. 55 As shown in Figure 1, both Map and Reduce phases use < key, value > pairs as the inputs and the outputs of two functions. In the Map phase, the Map function takes a single < key, value > pair as its input and outputs a list of intermediate < key, value > pairs.…”

Section: The Framework Of Map-reducementioning

confidence: 99%

See 1 more Smart Citation

A fast rank mutual information based decision tree and its implementation via Map‐Reduce

Wang

Liu

2018

Self Cite

Summary To address the time‐consuming problem for the confirmation of splitting attributes and splitting points in classic rank mutual information based decision trees, this paper establishes a fast rank mutual information based decision tree (FRMIDT) for classification problems. First, the proposed FRMIDT algorithm improves the velocity by a max‐relevance and min‐redundancy criterion to remove the redundant attributes in each tree node building. Then, the fuzzy c‐means algorithm is employed to confirm the splitting points for further acceleration. Meanwhile, a parallel implementation is developed in the framework of Map‐Reduce (MR‐FRMIDT) for medium or large‐scale data classification. Several comparative studies are conducted on UCI benchmark data sets. In contrast to the classic rank mutual information based decision tree on 12 data sets, the proposed FRMIDT model effectively reduces the computational time on the premise of keeping testing accuracy. Furthermore, the proposed FRMIDT algorithm is comparable through comparing FRMIDT with other traditional decision tree classifiers including BFT, C4.5, LAD, NBT, and SC. Meanwhile, the comparison with 7 different popular splitting measures based monotonic decision trees on several data sets illustrates the effectiveness of FRMIDT in monotonic classification. At last, the experimental analysis on other 6 data sets shows that the proposed MR‐FRMIDT is feasible and has a good parallel performance on reducing execution time and avoiding memory restrictions.

“…4,18,19 A lot of work is available in the past, where the MapReduce-Hadoop has been optimized. 21 The major MapReduce scheduling algorithms, such as FIFO, 22 Matchmaking and Delay, 23 and Multithreading locality (MTL), 24 improves the efficiency of MapReduce processing on virtualized infrastructure. 20 A parallel MapReduce version of the serial C4.5 decision tree learning algorithm (MR-C4.5) proves high speed-up and scalability.…”

mentioning

confidence: 99%

An improved integrated Grid and MapReduce‐Hadoop architecture for spatial data: Hilbert TGS R‐Tree–based IGSIM

Singh

Bawa

2019

Variegated distributed computing technologies have been used in recent years of revolutionary phase for efficiently and logically planned spatial data analysis. Grid computing and MapReduce technologies have provided a prodigious technological furtherance in the Geographic Information System (GIS) domain. The Grid is known for its high computing and the MapReduce implementation-Hadoop is known for its data analytics. A lot of research exist to prove that the integration of Grid and MapReduce complements each other. In our earlier work, a novel architecture Integrated Grid and Spatially Indexed MapReduce (IGSIM) was proposed that integrates Grid and SpatialHadoop for fast spatial queries. The R-Tree and the R * -Tree spatial indexes of SpatialHadoop were exploited for fast data accessing in the IGSIM. However, efficiency of spatial queries can be enhanced further by employing a better spatial indexing algorithm. In this paper, a thorough literature survey has been done on the available traditional spatial indexes from the serial programming environment and Hilbert TGS R-Tree has been selected on the basis of several parameters for its parallel implementation and extending spatial query efficiency work of the IGSIM. The improved architecture is named as Hilbert TGS R-Tree-based IGSIM.The experimental results demonstrate high efficiency of the proposed work. KEYWORDS grid, Hadoop, MapReduce, spatial index, spatial query 1 INTRODUCTION The processing of spatial data using distributed computing has given a new magnitude to voluminous and complex spatial data. A vast number of applications have been benefitted with advancement in distributed computing technologies. Distributed and parallel processing accelerates processing and eventually improves efficiency. GIS applications show a slow response in the conventional GIS software. The web provided an interoperable platform for distributed GIS data and services. However, the stateless nature of the web became a bottleneck in its wide adoption. Grid computing and MapReduce technologies have become conveniently accessible in dealing with spatial data and related queries efficiently. Grid-GIS and Hadoop-GIS are practical implementations of using GIS on the Grid and MapReduce, respectively. The Grid-GIS provides high computing environment, whereas the Hadoop-GIS provides high data analytic environment. The fault tolerance is an internal process in the Hadoop and its implementation in the Hadoop is relatively easier than in the Grid. 1 The Hadoop provides high scalability and fault tolerance on spatial data set and queries. 2 In a research paper, the authors have experimentally shown that accuracy of outputs generated for processed spatial data in Hadoop is similar to that generated using the GIS software ArcGIS, 3 and so it encourages using parallelization framework-Hadoop for spatial data analysis. The pros and cons of MapReduce-Hadoop have been critically examined in the work of Dean and Ghemawat. 4 Similarly, the developments in the field of Grid-GIS and cons and pros of G...