Parallel Density-Based Clustering of Complex Objects

Brecheisen, Stefan; Kriegel, Hans‐Peter; Pfeifle, Martin

doi:10.1007/11731139_22

Cited by 46 publications

(22 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiment shows the algorithm achieves near linear speedup. Brecheisen et al [10] present a parallel DBSCAN on a workstation, which is parallelized by a conservative approximation of complex distance functions, based on the concept of filter merge points. The final result is derived from a global cluster connectivity graph.…”

Section: Related Workmentioning

confidence: 99%

“…= Unclassified then (7) if . = Unclassified then (8) Create c-cluster ← ( ); (9) if expandcluster ( , , , , , ) then (10) Create c-cluster ← ( ++); (11) end if (12) end if (13) the cells according to the received points and given in each of the mappers, so the cell with the same in different mappers stands for the same area; thus we can only use the cell to locate the assigned range in overall data space. (2) The points in an inclusive cell must belong to a c-cluster, so the cell and c-cluster are enough to stand for classification of all points in the cell.…”

Section: Cludoop Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Zhao

Wang

et al. 2015

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

Density-based clustering for big data is critical for many modern applications ranging from Internet data processing to massivescale moving object management. This paper proposes Cludoop algorithm, an efficient distributed density-based clustering for big data using Hadoop. First, we propose a serial clustering algorithm CluC by leveraging cell partition optimization and c-cluster to fast find clusters. CluC completes classification of the points using the relationships of connected cells around points instead of expensive completed neighbor query, which significantly reduce the number of distance calculations. Second, we propose the Cludoop, which can efficiently cluster very-large-scale data in parallel using already existing data partition on Map/Reduce platform. It employs the proposed serial clustering CluC as a plugged-in clustering on parallel mapper, along with a cell description instead of completed cell in transmission to reduce both network and I/O costs. Guided by proposed cell-based principles, we also design a Merging-Refinement-Merging 3-step framework to merge c-clusters on the overlay of assigned preclustering result on reducer. Finally, our comprehensive experimental evaluation on 10 network-connected commercial PCs, using both huge-volume real and synthetic data, demonstrates (1) the effectiveness of our algorithm in finding correct clusters with arbitrary shape and (2) the fact that our proposed algorithm exhibits better scalability and efficiency than state-of-the-art method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Cludoop Frameworkmentioning

confidence: 99%

Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Zhao

Wang

et al. 2015

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

show abstract

“…The algorithm starts with an arbitrary point x ∈ X and retrieves its eps-neighborhood (Line 4). If the epsneighborhood contains at least minpts points, the procedure yields a new cluster, C. The algorithm then retrieves all points in X, which are density reachable from x and adds them to the cluster C (Line [8][9][10][11][12][13][14][15][16][17]. If the eps-neighborhood of x has less than minpts, then x is marked as noise (Line 6).…”

Section: Condition 23 (Noise)mentioning

confidence: 99%

“…The speedups are plotted in Figure 5 Finally, we have compared our parallel DBSCAN algorithm with the previous master-slave approaches [14], [15], [16], [17], [18], [19], [20]. As their source codes are not available, we have implemented their ideas, where the master process perform the cluster assignment while the slave processes answer the neighborhood queries [15], [17].…”

Section: B Parallel Dbscan On a Shared Memory Computermentioning

confidence: 99%

“…Many existing parallelizations adopt the master-slave model. For example, in [14], the data is equally partitioned and distributed among the slaves, each of which computes the clusters locally and sends back the results to the master in which the partially clustered results are merged sequentially to obtain the final result. This strategy incurs a high communication overhead between the master and slaves, and a low parallel efficiency during the merging process.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Patwary

Palsetia

Agrawal

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

108

View full text Add to dashboard Cite

Abstract-DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency.We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory.Using data sets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

show abstract

Data Parallel density‐based genetic clustering on CUDA Architecture

Krömer

Platoš

Snåšel

2013

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARY Evolutionary clustering algorithms have been proven as a good ability to find clusters in data. Among their advantages belong the abilities to adapt to data and to determine the number of clusters automatically, thus requiring less a priori assumptions about analyzed objects than traditional clustering methods. Unfortunately, such a clustering by genetic algorithms and evolutionary algorithms in general suffers from high computational costs when it comes to recurrent fitness function evaluation. Computing on graphic processing units (GPUs) is a recent programming and development paradigm bringing high performance parallel computing closer to general audience. Modern general purpose GPUs are composed of tens to thousands of computational cores that can execute programs in parallel using the single instruction multiple data parallel processing approach. General purpose GPU programs need to be designed and implemented in a data parallel way and with respect to the architecture of target devices to fully utilize their high performance. This study presents a design, implementation, and evaluation of a data parallel genetic algorithm for density‐based clustering. The algorithm was implemented and evaluated on the nVidia Compute Unified Device Architecture (CUDA) platform. Copyright © 2013 John Wiley & Sons, Ltd.

show abstract

Parallel Density-Based Clustering of Complex Objects

Cited by 46 publications

References 8 publications

Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Data Parallel density‐based genetic clustering on CUDA Architecture

Contact Info

Product

Resources

About