Mining Very Large Databases with Parallel Processing

Freitas, Alex A.; Lavington, Simon

doi:10.1007/978-1-4615-5521-6

Cited by 83 publications

(81 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The point is that the processing time taken per small disjunct is relatively short even when using a genetic algorithm, since there are just a few examples in the training set of a small disjunct. Finally, if necessary the processing time taken by all the c * d GA runs can be considerably reduced by using parallel processing techniques [5]. Actually, our method greatly facilitates the exploitation of parallelism in the discovery of small disjunct rules, since each GA run is completely independent from the others and it needs to have access only to a small data set, which surely can be kept in the local memory of a simple processor node.…”

Section: Computational Resultsmentioning

confidence: 99%

A genetic-algorithm for discovering small-disjunct rules in data mining

Carvalho

Freitas

2002

Applied Soft Computing

Self Cite

View full text Add to dashboard Cite

This paper addresses the well-known classification task of data mining, where the goal is to discover rules predicting the class of examples (records of a given data set). In the context of data mining, small disjuncts are rules covering a small number of examples. Hence, these rules are usually error-prone, which contributes to a decrease in predictive accuracy. At first glance, this is not a serious problem, since the impact on predictive accuracy should be small. However, although each small disjunct covers few examples, the set of all small disjuncts can cover a large number of examples. This paper presents evidence that this is the case in several data sets. This paper also addresses the problem of small disjuncts by using a hybrid decision-tree/genetic algorithm approach. In essence, examples belonging to large disjuncts are classified by rules produced by a decision-tree algorithm (C4.5), while examples belonging to small disjuncts are classified by a genetic algorithm specifically designed for discovering small-disjunct rules. We present results comparing the predictive accuracy of this hybrid system with the prediction accuracy of three versions of C4.5 alone in eight public domain data sets. Overall, the results show that our hybrid system achieves better predictive accuracy than all three versions of C4.5 alone.

show abstract

Section: Computational Resultsmentioning

confidence: 99%

A genetic-algorithm for discovering small-disjunct rules in data mining

Carvalho

Freitas

2002

Applied Soft Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…One possibility is to use parallel processing techniques, since EAs can be easily parallelized in an effective way (Cantu-Paz 2000; Freitas & Lavington 1998;Freitas 2002a). Another possibility is to compute the fitness of individuals by using only a subset of training instances -where that subset can be chosen either at random or using adaptive instance-selection techniques (Bhattacharyya 1998;Gathercole & Ross 1997;Sharpe & Glover 1999;Freitas 2002a).…”

Section: Discussionmentioning

confidence: 99%

A Review of evolutionary Algorithms for Data Mining

Freitas

2008

Soft Computing for Knowledge Discovery and Data Mining

Self Cite

View full text Add to dashboard Cite

Summary. Evolutionary Algorithms (EAs) are stochastic search algorithms inspired by the process of neo-Darwinian evolution. The motivation for applying EAs to data mining is that they are robust, adaptive search techniques that perform a global search in the solution space. This chapter first presents a brief overview of EAs, focusing mainly on two kinds of EAs, viz. Genetic Algorithms (GAs) and Genetic Programming (GP). Then the chapter reviews the main concepts and principles used by EAs designed for solving several data mining tasks, namely: discovery of classification rules, clustering, attribute selection and attribute construction. Finally, it discusses Multi-Objective EAs, based on the concept of Pareto dominance, and their use in several data mining tasks.

show abstract

“…Dividing data by features [6] requires the workers to coordinate which input data instance falls into which tree-node. This requires additional communication, which we try to avoid as we scale to very large data sets.…”

Section: Related Workmentioning

confidence: 99%

“…It builds the robust regression tree on the master by exactly calculating the robust loss functions in a distributed way. SRT: It refers to the distributed regression tree based on square error criteria [17] in Apache Spark machine learning tool set 6 . Prior to the tree induction, a pre-processing step is performed to obtain static and equidepth histograms for each feature and the split points are constantly selected from the bins of such histograms in the training phase.…”

Section: Setupmentioning

confidence: 99%

Efficient Distributed Decision Trees for Robust Regression

Guo

Kutzkov

Ahmed

et al. 2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.

show abstract

Mining Very Large Databases with Parallel Processing

Cited by 83 publications

References 0 publications

A genetic-algorithm for discovering small-disjunct rules in data mining

A genetic-algorithm for discovering small-disjunct rules in data mining

A Review of evolutionary Algorithms for Data Mining

Efficient Distributed Decision Trees for Robust Regression

Contact Info

Product

Resources

About