Rethinking k-means clustering in the age of massive datasets: a constant-time approach

Olukanmi, Peter O.; Nelwamondo, Fulufhelo V.; Marwala, Tshilidzi

doi:10.1007/s00521-019-04673-0

Cited by 17 publications

(12 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, k-means clustering is employed to test the proposed method in this paper. Recently, several techniques have been proposed to improve the standard k-means algorithm for high dimensional datasets, such as the Entropy Regularized Power k-Means [4], sparse k-means [41] and others [24]. The proposed k-Fold CV for unsupervised learning can also be applied to these modified versions of the k-means algorithm.…”

Section: Proposed Methodsmentioning

confidence: 99%

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

2020

View full text Add to dashboard Cite

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

2020

View full text Add to dashboard Cite

show abstract

“…In the evaluations, we select 30 testing scenarios uniformly at random for each of the three SPLs so that we can use parametric statistical hypothesis tests to determine whether or not there is a significant difference between inc and sm approaches [27], [28]. Each testing scenario has an EP and a PUC; PUCs are used to collect data on the approaches.…”

Section: Spls Under Consideration and Testing Scenariosmentioning

confidence: 99%

Incremental Testing in Software Product Lines—An Event Based Approach

Beyazit¹,

Tuğlular

KAYA>

2023

IEEE Access

View full text Add to dashboard Cite

One way of developing fast, effective, and high-quality software products is to reuse previously developed software components and products. In the case of a product family, the software product line (SPL) approach can make reuse more effective. The goal of SPLs is faster development of low-cost and high-quality software products. This paper proposes an incremental model-based approach to test products in SPLs. The proposed approach utilizes event-based behavioral models of the SPL features. It reuses existing event-based feature models and event-based product models along with their test cases to generate test cases for each new product developed by adding a new feature to an existing product. Newly introduced featured event sequence graphs (FESGs) are used for behavioral feature and product modeling; thus, generated test cases are event sequences. The paper presents evaluations with three software product lines to validate the approach and analyze its characteristics by comparing it to the state-of-the-art ESGbased testing approach. Results show that the proposed incremental testing approach highly reuses the existing test sets as intended. Also, it is superior to the state-of-the-art approach in terms of fault detection effectiveness and test generation effort but inferior in terms of test set size and test execution effort. INDEX TERMS incremental testing, model-based testing, software product lineThis article has been accepted for publication in IEEE Access.

show abstract

“…The output of segmentation is a set of k non-overlapping segments {S 1 , S 2 , … , S k } that comprises the whole segmented representation of a dataset X in the form of [15]:…”

Section: The Color Image Segmentation Algorithmmentioning

confidence: 99%

SFE2D: A Hybrid Tool for Spatial and Spectral Feature Extraction

Abbassi¹,

Cheng²

2022

Mining Technology

View full text Add to dashboard Cite

A crucial task for integrated geoscientific image (geo-image) interpretation is the relevant geological representation of multiple geo-images, which demands high-dimensional techniques for extracting latent geological features from high-dimensional geo-images. A standalone mathematical tool called SFE2D (spatiospectral feature extraction in two-dimension) is developed based on independent component analysis (ICA), continuous wavelet transform (CWT), k-means clustering segmentation, and RGB color processing that iteratively separates, extracts, clusters, and visualizes the highly correlated and overlapped geological features from multiple sources of geo-images. The SFE2D offers spatial feature extraction and wavelet-based spectral feature extraction for further extraction of frequency-dependent features. We show that the SFE2D is a robust tool for automated pattern recognition, fast pseudo-geological mapping, and detection of regions of interest with a wide range of applications in different scales, from regional geophysical surveys to the interpretation of microscopic images.

show abstract

Rethinking k-means clustering in the age of massive datasets: a constant-time approach

Cited by 17 publications

References 48 publications

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Incremental Testing in Software Product Lines—An Event Based Approach

SFE2D: A Hybrid Tool for Spatial and Spectral Feature Extraction

Contact Info

Product

Resources

About