An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations

Vouros, Avgoustinos; Langdell, Stephen; Croucher, Mike; Vasilaki, Eleni

doi:10.1007/s10994-021-06021-7

Cited by 25 publications

(16 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We applied K-Means clustering on these transformed data and the original dataset, varying the number of clusters from 1 to 10. As a clustering algorithm, we used Lloyd’s K-Means 37 initialised with the Density K-Means++ method 38 which worked well in our previous benchmark 39 . We used the R package clustree 40 to visualize a K-Means clustering tree (number of target clusters 1 to 10) for data transformations with different numbers of PCs (2 to 11, leading to different data dimensionality, see also Supplementary Material ).…”

Section: Methodsmentioning

confidence: 99%

Strategies discovery in the active allothetic place avoidance task

Vouros

Gehring

Jura

et al. 2022

Sci Rep

Self Cite

View full text Add to dashboard Cite

The Active Allothetic Place Avoidance task is an alternative setup to Morris Water Maze that allows studying spatial memory in a dynamic world in the presence of conflicting information. In this task, a rat, freely moving on a rotating circular arena, has to avoid a sector defined within the room frame where shocks are presented. While for Morris Water Maze several studies have identified animal strategies which specifically affect performance, there were no such studies for the Active Allothetic Place Avoidance task. Using standard machine learning methods, we were able to reveal for the first time, to the best of our knowledge, explainable strategies that the animals employ in this task and demonstrate that they can provide a high-level interpretation for performance differences between an animal group treated with silver nanoparticles (AgNPs) and the control group.

show abstract

Section: Methodsmentioning

confidence: 99%

Strategies discovery in the active allothetic place avoidance task

Vouros

Gehring

Jura

et al. 2022

Sci Rep

Self Cite

View full text Add to dashboard Cite

show abstract

“…SSE, Silhouette score, Purity and CPU time are used to measure the performance of our proposed method. While SSE, Purity and Silhouette score measure the quality of the clusters formed, CPU Time measures efficiency (7,(13)(14)(15) . These evaluation criteria are explained below:…”

Section: Evaluation Criteriamentioning

confidence: 99%

“…2. Purity: It is an external validity index that measures the degree of similarity between the clustering solution formed by a clustering method and that specified by the given class labels (7).…”

Section: Sum Of Squared Error (Sse)mentioning

confidence: 99%

“…Various deterministic and non-deterministic initialization methods have been analyzed and their characteristics explained in literature (5,7) . While, non-deterministic initialization methods often produce highly variable results across multiple runs and need to be executed multiple times and the output of the run that gives the least SSE is taken as the result; deterministic methods produce identical results across multiple runs and need to be executed only once.…”

Section: Introductionmentioning

confidence: 99%

“…However, may not be helpful in case of the fuzzy c-means and k-means frameworks as these employ the mean as their center of gravity. In (7) , an empirical comparison is carried out among the various stochastic and deterministic methods of centroid initialization including random and K-means++, for the K-means variations on synthetic data. They demonstrate that, on average, more advanced initialization strategies reduce the requirement for intricate clustering strategies.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An Effective Initialization Method Based on Quartiles for the K-means Algorithm

Jambudi¹,

Gandhi²

2022

IJST

View full text Add to dashboard Cite

Objectives: This study aims to speed up the K-means algorithm by offering a deterministic quartile-based seeding strategy for initializing preliminary cluster centers for the K-means algorithm, enabling it to efficiently build high-quality clusters. Methods: We have investigated various cluster center initialization approaches in literature and presented our findings. For the Kmeans algorithm, we here propose a novel deterministic technique based on quartiles for finding initial cluster centers. To obtain the preliminary cluster centers, we have applied our suggested approach to the data set. The initial cluster centers determined by our suggested method are then entered into the K-means algorithm. The proposed seeding method is evaluated on sixteen benchmark clustering data sets: five synthetic and eleven real data sets. Python is used to run the simulation. Findings: Based on empirical results from experiments, it is evident that our proposed cluster center initialization method allows the K-means algorithm to form clusters with SSE values comparable to the minimum SSE values produced by repeated Random or K-means++ initializations. Furthermore, our deterministic initialization strategy assures that the K-means algorithm converges faster than the Random and K-means++ initialization techniques. Novelty: In this study, we explore the potential of quartile-based seeding as a technique of accelerating the Kmeans algorithm. Needless to add, as our seeding method is deterministic, the requirement to run K-means repeatedly with different stochastic initializations is completely eliminated. Also, our initialization strategy assures that there is remarkable saving in execution time as compared to the Random and K-means++ initialization techniques. Moreover, it is found that after initializing with our offered method, the solution obtained with just a single run of K-means produces optimal clusters. Applications: Our proposed seeding technique will be helpful for initializing the K-means algorithm in time-sensitive applications, applications managing large amounts of data, and applications looking for deterministic cluster solutions.

show abstract