ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets

Joshi, Manjunath V.; Karypis, George; Kumar, Vipin

doi:10.1109/ipps.1998.669983

Cited by 95 publications

(79 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…[1] is a MapReduce algorithm which builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. In [11] ScalParC employs a distributed hash table to implement the splitting phase for classification problems. [9] approach uses sampling to achieve memory efficient processing of numerical attributes for Gini impurity in the classification tree.…”

Section: Related Workmentioning

confidence: 99%

Efficient Distributed Decision Trees for Robust Regression

Guo

Kutzkov

Ahmed

et al. 2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Distributed Decision Trees for Robust Regression

Guo

Kutzkov

Ahmed

et al. 2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

show abstract

“…Considering the parallelisation methodology, related works can be categorised as follow: 1) exploiting attributes parallelism by partitioning the training set by columns and then adopting data parallelism [11,18]; 2) exploiting nodes parallelism by independently building different nodes or sub-trees adopting task parallelism [6]; 3) combining the two above in various fashions [20,23]. Several works focus on distributed-memory machines, including SPRINT [18], ScalParC [11], pCLOUDS [20], and the approach of [6].…”

Section: Related Workmentioning

confidence: 99%

“…Several works focus on distributed-memory machines, including SPRINT [18], ScalParC [11], pCLOUDS [20], and the approach of [6]. The use of scalable data structures and of efficient load balancing techniques, trying to minimise costly data redistribution operations, are the most important factors to obtain good performance.…”

Section: Related Workmentioning

confidence: 99%

Porting Decision Tree Algorithms to Multicore Using FastFlow

Aldinucci¹,

Ruggieri²,

Torquati³

2010

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7× speedup on an Intel dual-quad core machine.

show abstract

“…Much work has been done on parallelization of classification algorithms [30,33,35,38]. The algorithm for defect categorization we parallelize is very different than the algorithms considered in these efforts, and therefore, the issues in parallelization are quite different.…”

Section: Related Workmentioning

confidence: 99%

Middleware for data mining applications on clusters and grids

Glimcher

Jin

Agrawal

2008

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

This paper gives an overview of two middleware systems that have been developed over the last 6 years to address the challenges involved in developing parallel and distributed implementations of data mining algorithms. FREERIDE (FRamework for Rapid Implementation of Data mining Engines) focuses on data mining in a cluster environment. FREERIDE is based on the observation that parallel versions of several well-known data mining techniques share a relatively similar structure, and can be parallelized by dividing the data instances (or records or transactions) among the nodes. The computation on each node involves reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction. The reduction involves only commutative and associative operations, which means the result is independent of the order in which the data instances are processed. After the local reduction on each node, a global reduction is performed. This similarity in the structure can be exploited by the middleware system to execute the data mining tasks efficiently in parallel, starting from a relatively high-level specification of the technique.To enable processing of data sets stored in remote data repositories, we have extended FREERIDE middleware into FREERIDE-G (FRamework for Rapid Implementation of Data mining Engines in Grid). FREERIDE-G supports a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. The added functionality in FREERIDE-G aims at abstracting the details of remote data retrieval, movements, and caching from application developers.

show abstract

ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets

Cited by 95 publications

References 4 publications

Efficient Distributed Decision Trees for Robust Regression

Efficient Distributed Decision Trees for Robust Regression

Porting Decision Tree Algorithms to Multicore Using FastFlow

Middleware for data mining applications on clusters and grids

Contact Info

Product

Resources

About