Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up

Zhang, Bin; Hsu, Meichun; Forman, George

doi:10.1007/3-540-45372-5_24

Cited by 14 publications

(17 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous paper [ZHF00], we described a parallel decomposition for center-based clustering algorithms that limits inter-processor communication to sufficient statistics only, reducing the network bottleneck. The data set is partitioned randomly across the memory of the processors and does not need to be transferred between iterations.…”

Section: II Backgroundmentioning

confidence: 99%

“…In a companion paper [ZHF00], we developed a class of parallel iterative parameter estimation algorithms, covering the centerbased clustering algorithms K-Means [M67] [GG92], KHarmonic Means [ZHD00a] [Z00b], and EM [DLR77] [MK97]. The parallelization is resource efficient and operates without approximation to the original sequential algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Distributed data clustering can be efficient and exact

Forman

Zhang

2000

SIGKDD Explor. Newsl.

Self Cite

View full text Add to dashboard Cite

Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes.Many of these datasets are geographically distributed across multiple sites, e.g. different sales or warehouse locations. To cluster such large and distributed data sets, efficient distributed algorithms are called for to reduce the communication overhead, central storage requirements, and computation time, as well as to bring the resources of multiple machines to bear on a given problem as the data set sizes scale-up. We describe a technique for parallelizing a family of center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We demonstrate in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data inplace using an exact distributed algorithm than to collect the data in one central location for clustering.

show abstract

Section: II Backgroundmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Distributed data clustering can be efficient and exact

Forman

Zhang

2000

SIGKDD Explor. Newsl.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this section, we will present a brief overview of some VQ techniques, both serial (LBG [29], Kmeans [31] and ELBG [36]) and parallel (PKM [38,39,1,9,43,11,34,33], PARELBG [34,37], P-CLUSTER [23][24][25] and PAUL [4]), selected from the large existing literature.…”

Section: Previous Workmentioning

confidence: 99%

“…Various hardware architectures have been employed such as, for example: specialized architectures [33], massively parallel processors [38], transputers [39,1] and networks of workstations [9,43,11,34]. The idea at the basis of such techniques is the subdivision of the most timeconsuming part of the algorithm (the calculation of the Voronoi partition) into a certain number of subtasks to be executed in parallel, while, the remaining operations (the calculation of the new centroids) are serially executed by a single process.…”

Section: Parallel K-means (Pkm): a Family Of Pisa Algorithmsmentioning

confidence: 99%

“…When the task is very complex, i.e., both large data sets and large codebooks are involved (for example, in image segmentation [21], image coding and speech coding [42,30]), the computing time required by a classical approach [29,36] may be prohibitive. A possible solution to this problem is the use of parallel and distributed computing systems [38,39,1,9,43,11,24]. Recent studies [34,37] have shown that, for an efficient parallel implementation of the most advanced (serial) VQ techniques, it is necessary to use computing systems whose hardware provides a large bandwidth for inter-process communications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LBGS: a smart approach for very large data sets vector quantization

Campobello

Mantineo

Patanè

et al. 2005

Signal Processing: Image Communication

View full text Add to dashboard Cite

Distributed data mining patterns and services: an architecture and experiments

Cesario¹,

Talia²

2011

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARY Distributed data mining implements techniques for analyzing data on distributed computing systems by exploiting data distribution and parallel algorithms. The grid is a computing infrastructure for implementing distributed high‐performance applications and solving complex problems, offering effective support to the implementation and use of data mining and knowledge discovery systems. The Web Services Resource Framework has become the standard for the implementation of grid services and applications, and it can be exploited for developing high‐level services for distributed data mining applications. This paper describes how distributed data mining patterns, such as collective learning, ensemble learning, and meta‐learning models, can be implemented as Web Services Resource Framework mining services by exploiting the grid infrastructure. The goal of this work was to design a distributed architectural model that can be exploited for different distributed mining patterns deployed as grid services for the analysis of dispersed data sources. In order to validate such an approach, we presented also the implementation of two clustering algorithms on the developed architecture. In particular, the distributed k‐means and distributed expectation maximization were exploited as pilot examples to show the suitability of the implemented service‐oriented framework. An extensive evaluation of its performance was provided. Copyright © 2011 John Wiley & Sons, Ltd.

show abstract

Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up

Cited by 14 publications

References 11 publications

Distributed data clustering can be efficient and exact

Distributed data clustering can be efficient and exact

LBGS: a smart approach for very large data sets vector quantization

Distributed data mining patterns and services: an architecture and experiments

Contact Info

Product

Resources

About