Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processin
DOI: 10.1109/ipps.1998.669983
|View full text |Cite
|
Sign up to set email alerts
|

ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets

Abstract: In this papel; we present ScalParC (Scalable Parallel Classifier), a new parallel formulation of a decision tree based classification process. Like other state-of-the-art decision tree classifiers such as SPRINT, ScalParC is suited for handling large datasets. We show that existing parallel formulation of SPRINT is unscalable, whereas ScalParC is shown to be scalable in both runtime and memory requirements. We present the experimental results of classifying up to 6.4 million records on up to 128 processors of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
78
0
1

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 95 publications
(79 citation statements)
references
References 4 publications
0
78
0
1
Order By: Relevance
“…[1] is a MapReduce algorithm which builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. In [11] ScalParC employs a distributed hash table to implement the splitting phase for classification problems. [9] approach uses sampling to achieve memory efficient processing of numerical attributes for Gini impurity in the classification tree.…”
Section: Related Workmentioning
confidence: 99%
“…[1] is a MapReduce algorithm which builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. In [11] ScalParC employs a distributed hash table to implement the splitting phase for classification problems. [9] approach uses sampling to achieve memory efficient processing of numerical attributes for Gini impurity in the classification tree.…”
Section: Related Workmentioning
confidence: 99%
“…Considering the parallelisation methodology, related works can be categorised as follow: 1) exploiting attributes parallelism by partitioning the training set by columns and then adopting data parallelism [11,18]; 2) exploiting nodes parallelism by independently building different nodes or sub-trees adopting task parallelism [6]; 3) combining the two above in various fashions [20,23]. Several works focus on distributed-memory machines, including SPRINT [18], ScalParC [11], pCLOUDS [20], and the approach of [6].…”
Section: Related Workmentioning
confidence: 99%
“…Several works focus on distributed-memory machines, including SPRINT [18], ScalParC [11], pCLOUDS [20], and the approach of [6]. The use of scalable data structures and of efficient load balancing techniques, trying to minimise costly data redistribution operations, are the most important factors to obtain good performance.…”
Section: Related Workmentioning
confidence: 99%
“…Much work has been done on parallelization of classification algorithms [30,33,35,38]. The algorithm for defect categorization we parallelize is very different than the algorithms considered in these efforts, and therefore, the issues in parallelization are quite different.…”
Section: Related Workmentioning
confidence: 99%