Hybrid Partitioning-Density Algorithm for K-Means Clustering of Distributed Data Utilizing OPTICS

Markiewicz, Mikołaj; Koperwas, Jakub

doi:10.4018/ijdwm.2019100101

Cited by 2 publications

(1 citation statement)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…K‐means is also well supported in the Mlib machine learning library of Apache Spark 18 . For example, DBSCAN, 19 OPTICS, 20,21 and other algorithms perform clustering based on the dense density of datasets in spatial distribution, wherein the number of clusters need not be set in advance; thus, they are particularly suitable for clustering datasets with unknown content. In the context of big data, the optimization and innovation of these algorithms are still very important research prospects 18…”

Section: Related Workmentioning

confidence: 99%

DACA: Distributed adaptive grid decision graph based clustering algorithm

Zhou

Wang

et al. 2021

Softw Pract Exp

View full text Add to dashboard Cite

Clustering algorithms play a very important role in machine learning. With the development of big-data artificial intelligence, distributed parallel algorithms have become an important research field. To reduce the computational complexity and running time of large-scale datasets in the clustering process, this study proposes a distributed clustering algorithm DACA (distributed adaptive grid decision graph based clustering algorithm). In a distributed environment, DACA uses relative entropy to adaptively mesh the data to form an obvious sparse grid and dense grid. Then, the decision graph is used to determine the cluster center mesh object. Finally, the KD-tree is used to accelerate the determination of the cluster center of sparse points to complete clustering. The algorithm is implemented using the popular Apache Spark computing framework, compared with other distributed clustering algorithms, DACA can adaptively divide the grid according to the data distribution to obtain better clustering effect. At the same time, KD tree algorithm is used to speed up the decision-making of clustering center. Numerous experiments show that the DACA algorithm has excellent performance and accuracy on six standard datasets and real GPS trajectory datasets. K E Y W O R D Sadaptive grid division, clustering algorithms, decision graphs, distributed, KD-tree INTRODUCTIONRecently, with the rapid development of the internet, the amount of data generated each year has been exponentially increasing. The global data volume could reach 1.8 ZB in 2 days (one of Zettabyte is 2 70 byte) 1 in 2014. By 2021, the growth of data generation will become even higher. Thus, the conventional single application-driven internet services have to be converted into data-and application-driven internet services. Therefore, big-data artificial intelligence and machine learning have been rapidly developing, the most obvious of which is the research on big data clustering algorithms, which is often discussed in the fields of internet business and academic research. 2 Clustering algorithms, which are a type of important algorithms used for data mining, have been widely employed for data analysis, image processing, market research, user segmentation, web document classification, and other fields. At present, clustering algorithms can be roughly divided into three types-traditional clustering, intelligent

show abstract

Section: Related Workmentioning

confidence: 99%

DACA: Distributed adaptive grid decision graph based clustering algorithm

Zhou

Wang

et al. 2021

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

Evaluation Platform for DDM Algorithms With the Usage of Non-Uniform Data Distribution Strategies

Markiewicz

Koperwas

2021

International Journal of Information Technologies and Systems Approach

Self Cite

View full text Add to dashboard Cite

Huge amounts of data are collected in numerous independent data storage facilities around the world. However, how the data is distributed between physical locations remains unspecified. Downloading all of the data for the purpose of processing it is undesirable and sometimes even impossible. Various methods have been proposed for performing data mining tasks, but the main problem is the lack of an objective strategy for comparing them. The authors present current research on a novel evaluation platform for distributed data mining (DDM) algorithms. The proposed platform opens up a new field to evaluate algorithms in terms of the quality of the results, transfer used, and speed, but also for the use of a non-uniform data distribution among independent nodes during algorithm evaluation. This work introduces a ‘data partitioning strategy’ term referring to a specific, not necessarily uniform data distribution. A brief evaluation for three clustering algorithms is also reported, showing the usability and simplicity of identifying differences in processing with the use of the platform.

show abstract

Hybrid Partitioning-Density Algorithm for K-Means Clustering of Distributed Data Utilizing OPTICS

Cited by 2 publications

References 22 publications

DACA: Distributed adaptive grid decision graph based clustering algorithm

DACA: Distributed adaptive grid decision graph based clustering algorithm

Evaluation Platform for DDM Algorithms With the Usage of Non-Uniform Data Distribution Strategies

Contact Info

Product

Resources

About