Efficient &lt;inline-formula&gt;&lt;tex-math&gt;$k$&lt;/tex-math&gt;&lt;alternatives&gt; &lt;inline-graphic xlink:type="simple" xlink:href="qu-ieq1-2306193.gif"/&gt;&lt;/alternatives&gt;&lt;/inline-formula&gt;-Means++ Approximation with MapReduce

Xu, Yujie; Qu, Wenyu; Li, Zhiyang; Min, Geyong; Li, Keqiu; Liu, Zhaobin

doi:10.1109/tpds.2014.2306193

Cited by 47 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, for example MapReduce-based implementation of K-means needs multiple MapReduce jobs for the initialization. The MapReduce K-means++ method [7] tries to address this issue, as it uses one MapReduce job to select K initial prototypes, which speeds up the initialization compared to K-means . Suggestions of parallelizing the second, search phase of K-means have been given in several papers (see, e.g., [8,9]).…”

Section: Introductionmentioning

confidence: 99%

Improving Scalable K-Means++

2020

View full text Add to dashboard Cite

Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.

show abstract

Section: Introductionmentioning

confidence: 99%

Improving Scalable K-Means++

2020

View full text Add to dashboard Cite

show abstract

“…Metode k-means dan mini batch k-means memiliki kekurangan yang sama yaitu sensitif terhadap pusat cluster awal yang dipilih [8]. Metode k-means++ digunakan untuk mengatasi kekurangan pada k-means dan mini batch k-means dengan cara memilih pusat klaster pertama secara acak dan kemudian dipilih pusat klaster berdasarkan perhitungan jarak terdekat antara titik data dan pusat cluster yang dipilih [9].…”

Section: Pendahuluanunclassified

Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

Subali,

Sugiartha,

Budiarta

et al. 2023

JAIC

View full text Add to dashboard Cite

Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.

show abstract

“…Nevertheless, the performance of k-means was improved by combining it with the k-means++ initialization algorithm, which is a subprocess that seeds the centroids [60]. To this particular aim, k-means++ outperforms k-means both by achieving the classification task quicker and by a faster convergence to a minimal intra-class (intra-cluster) variance [60,61]. The k-means++ algorithm operates as follows:…”

Section: The K-means/k-means++ Clustering Algorithmmentioning

confidence: 99%

A Cluster Approach to Cloud Cover Classification over South America and Adjacent Oceans Using a k-means/k-means++ Unsupervised Algorithm on GOES IR Imagery

Yuchechen

Lakkis

Caferri³

et al. 2020

Remote Sensing

View full text Add to dashboard Cite

An unsupervised k-means/k-means++ clustering algorithm was implemented on daily images of standardized anomalies of brightness temperature (Tb) derived from the Geostationary Operational Environmental Satellite (GOES)-13 infrared data for the period 1 December 2010 to 30 November 2016. The goal was to decompose each individual Tb image into four clusters that captures the characteristics of different cloud regimes. The extracted clusters were ordered by their mean value in an ascending fashion so that the lower the cluster order, the higher the clouds they represent. A linear regression between temperature and height with temperature used as the predictor was conducted to estimate cloud top heights (CTHs) from the Tb values. The analysis of the results was performed in two different ways: sample dates and seasonal features. Cluster 1 is the less dominant one, representing clouds with the highest tops and variabilities. Cluster 4 is the most dominant one and represents a cloud regime that spans the lowest 2 km of the troposphere. Clusters 2 and 3 are entangled in the sense that both have their CTHs spanning the middle troposphere. Correlations between the monthly time series of the number of pixels in each cluster and of the entropy with several circulation indices are also introduced. Additionally, a fractal-related analysis was carried out on cluster 1 in order to resolve cirrus and cumulonimbus.

show abstract

Efficient <inline-formula><tex-math>$k$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="qu-ieq1-2306193.gif"/></alternatives></inline-formula>-Means++ Approximation with MapReduce

Cited by 47 publications

References 22 publications

Improving Scalable K-Means++

Improving Scalable K-Means++

Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

A Cluster Approach to Cloud Cover Classification over South America and Adjacent Oceans Using a k-means/k-means++ Unsupervised Algorithm on GOES IR Imagery

Contact Info

Product

Resources

About