In this paper the K-means (KM) and the Fuzzy C-means (FCM) algorithms were compared for their computing performance and clustering accuracy on different shaped cluster structures which are regularly and irregularly scattered in two dimensional space. While the accuracy of the KM with single pass was lower than those of the FCM, the KM with multiple starts showed nearly the same clustering accuracy with the FCM. Moreover the KM with multiple starts was extremely superior to the FCM in computing time in all datasets analyzed. Therefore, when well separated cluster structures spreading with regular patterns do exist in datasets the KM with multiple starts was recommended for cluster analysis because of its comparable accuracy and runtime performances.
Discretization is a data pre-processing task transforming continuous variables into discrete ones in order to apply some data mining algorithms such as association rules extraction and classification trees. In this study we empirically compared the performances of equal width intervals (EWI), equal frequency intervals (EFI) and Kmeans clustering (KMC) methods to discretize 14 continuous variables in a chicken egg quality traits dataset. We revealed that these unsupervised discretization methods can decrease the training error rates and increase the test accuracies of the classification tree models. By comparing the training errors and test accuracies of the model applied with C5.0 classification tree algorithm we also found that EWI, EFI and KMC methods produced the more or less similar results. Among the rules used for estimating the number of intervals, the Rice rule gave the best result with EWI but not with EFI. It was also found that Freedman-Diaconis rule with EFI and Doane rule with EFI and EWI slightly performed better than the other rules. Ayrıklaştırma, sınıflama ağaçları ve birliktelik kuralları çıkarma gibi bazı veri madenciliği algoritmalarında sürekli değişkenleri kesikli değişkenlere dönüştüren bir veri önişleme adımıdır. Bu çalışmada eşit genişlikli aralıklar (EWI), eşit frekanslı aralıklar (EFI) ve K-ortalamalar kümelemesi (KMC) yöntemleri, bir tavuk yumurtası kalite özellikleri veri setinde 14 sürekli değişkenin ayrıklaştırmasındaki performansları bakımından deneysel olarak karşılaştırılmıştır. Bu yönetimsiz ayrıklaştırma yönteminin sınıflama ağacı modelleri için öğrenme hatalarını düşürdüğü ve doğruluğu yükselttiği belirlenmiştir. C5.0 sınıflama ağacı algoritması kullanılarak uygulanan modelin öğrenme hatası ve test doğruluğu kullanılarak yapılan karşılaştırmalara göre EWI, EFI ve KMC yöntemlerinin birbirine yakın sonuçlar verdikleri görülmüştür. Yöntemlerde aralık sayısını hesaplamak için kullanılan kurallar arasında, Rice kuralı EFI'de olmamakla birlikte EWI ile en iyi sonucu üretmiştir. Ayrıca EWI ile Freedman-Diaconis kuralının ve EFI ve EWI'nin her ikisinde ise Doane kuralının diğer kurallardan kısmen daha iyi oldukları saptanmıştır.
A B S T R A C TIn data mining, cluster analysis is one of the widely used analytics to discover existing groups in datasets. However, the traditional clustering algorithms become insufficient for the analysis of big data which have been formed with the enormous increase in the amount of collected data in recent years. Therefore, the scalability has been one of the most intensively studied research topics for clustering big data. The parallel clustering algorithms and the Map-Reduce framework based techniques on multiple machines are getting popular in scalability for big data analysis. However, applying the sampling techniques on big datasets could be still alternative or complementary task in order to run the traditional algorithms on single machines. The results obtained in this study showed that the data size reduction by the simple random sampling could be successfully used in cluster analysis for large datasets. The clustering validities by running K-means algorithm on the sample datasets were found as high as those of the complete datasets. Additionally the required execution time for cluster analysis on the sample datasets was significantly shorter than those obtained for the complete datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.