k-means is a popular clustering algorithm because of its simplicity and scalability to handle large datasets. However, one of its setbacks is the challenge of identifying the correct k-hyperparameter value. Tuning this value correctly is critical for building effective k-means models. The use of the traditional elbow method to help identify this value has a long-standing literature. However, when using this method with certain datasets, smooth curves may appear, making it challenging to identify the k-value due to its unclear nature. On the other hand, various internal validation indexes, which are proposed as a solution to this issue, may be inconsistent. Although various techniques for solving smooth elbow challenges exist, k-hyperparameter tuning in high-dimensional spaces still remains intractable and an open research issue. In this paper, we have first reviewed the existing techniques for solving smooth elbow challenges. The identified research gaps are then utilized in the development of the new technique. The new technique, referred to as the ensemble-based technique of a self-adapting autoencoder and internal validation indexes, is then validated in high-dimensional space clustering. The optimal k-value, tuned by this technique using a voting scheme, is a trade-off between the number of clusters visualized in the autoencoder's latent space, k-value from the ensemble internal validation index score and one that generates a value of 0 or close to 0 on the derivativeat the elbow. Experimental results based on the Cochran's Q test, ANOVA, and McNemar's score indicate a relatively good performance of the newly developed technique in k-hyperparameter tuning.
KEYWORDS k-hyperparameter tuning; high-dimensional; smooth elbow
K-Means ArchitectureK-means is an iterative algorithm that aims to partition a dataset into a set of k non-overlapping groups of data points [9]. The k-hyperparameter is one of the most important hyperparameters to tune in k-means [10,11]. Tuning a machine learning model's hyperparameters has a significant effect on its performance [12]. In this subsection, we explore k-clusters, k-hyperparameters, and the traditional elbow method used to identify the optimal value for the k-hyperparameter.
K-Means ClustersThe k-means clusters are the resulting data sub-groups generated by the popular unsupervised partitioning algorithm, known as the k-means clustering algorithm [13]. Cluster analysis using the k-means algorithm is an example of a k-means-based model that has been successfully applied in cluster analysis [14,15]. The k-clusters generated by these algorithms are usually composed of distinct non-overlapping groups of data points, aggregated together because they share specific similarities [16]. The data points within a particular cluster are similar, while the data points across different clusters are different [17]. Both the intra-cluster and inter-cluster data points are measured using a sum of squared distance-based metric [18,19]. For this reason, the original un-partitioned dataset is stan...