Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Dinh, Duy-Tai; Fujinami, Tsutomu; Huynh, Van–Nam

doi:10.1007/978-981-15-1209-4_1

Cited by 100 publications

(64 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The number of clusters was chosen using the silhouette method. It enables finding the optimal number of clusters and interpreting and validating the consistency within the clusters of data [44][45][46]. The silhouette method combines two clustering criteria, namely, compactness and separation.…”

Section: Methodsmentioning

confidence: 99%

Assessment of the Dependence of GHG Emissions on the Support and Taxes in the EU Countries

2021

View full text Add to dashboard Cite

The reduction of GHG emissions is one of the priorities of the EU countries. The majority of studies show that financial support and environmental taxes are one of the most effective measures for the mitigation of the negative consequences of climate change. The EU countries employ different environmental support measures and environmental taxes to reduce GHG emissions. There is a shortage of new studies on these measures. The aim of the present study is to compare the effectiveness of the environmental support measures of the EU countries with the effectiveness of environmental taxes in relation to the reduction of GHG emissions. This study is characterized by the broad scope of its data analysis and its systematic approach to the EU’s environmental policy measures. An empirical study was performed for the EU countries with the aim of addressing this research problem and substantiating theoretical insights. A total of 27 EU member states from 2009 to 2018 were selected as research samples. The research is based on a cause-and-effect relationship, where the factors affecting environmental pollution (environmental taxes and subsidies) are the cause, and GHG emissions are the effect. Statistical research methods were used in the empirical study: descriptive statistics, the Shapiro–Wilk test, one-way analysis of variance (ANOVA), simple regression and cluster analysis. The results show that the older member countries of the EU, which had directed the financial measures of environmental policy towards a reduction in energy consumption, managed to achieve a greater reduction in GHG emissions compared to the countries which had not applied those measures. The Central and Eastern European countries are characterized by lower environmental taxes and lower expenditure allocated to environmental protection. The countries with a higher GDP per capita have greater GHG emissions that the countries with lower GDP per capita. This is associated with greater consumption, waste, and energy consumption. The study conducted gives rise to a discussion regarding data sufficiency in the assessment and forecasting of GHG emissions and their environmental consequences.

show abstract

Section: Methodsmentioning

confidence: 99%

Assessment of the Dependence of GHG Emissions on the Support and Taxes in the EU Countries

2021

View full text Add to dashboard Cite

show abstract

“…If the number of running states of rolling bearing contained in the dataset is known, the number of clusters is determined by the number of running states of rolling bearing. If the number of running states of rolling bearing contained in the dataset is unknown, the number of clusters can be dynamically determined by elbow method [36] or silhouette coefficient method [37]. The pheromone heuristic factor α indicates the relative importance of pheromone intensity, if the value of α is too large, the random search ability of the algorithm is easily weakened.…”

Section: Experiments a Experimental Setupmentioning

confidence: 99%

A Novel Bearing Fault Diagnosis Method Using Spark-Based Parallel ACO-K-Means Clustering Algorithm

Wan

Zhang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

K-Means clustering algorithm is a typical unsupervised learning method, and it has been widely used in the field of fault diagnosis. However, the traditional serial K-Means clustering algorithm is difficult to efficiently and accurately perform clustering analysis on the massive running-state monitoring data of rolling bearing. Therefore, a novel fault diagnosis method of rolling bearing using Spark-based parallel ant colony optimization (ACO)-K-Means clustering algorithm is proposed. Firstly, a Spark-based three-layer wavelet packet decomposition approach is developed to efficiently preprocess the running-state monitoring data to obtain eigenvectors, which are stored in Hadoop Distributed File System (HDFS) and served as the input of ACO-K-Means clustering algorithm. Secondly, ACO-K-Means clustering algorithm suitable for rolling bearing fault diagnosis is proposed to improve the diagnosis accuracy. ACO algorithm is adopted to obtain the global optimal initial clustering centers of K-Means from all eigenvectors, and the K-Means clustering algorithm based on weighted Euclidean distance is used to perform clustering analysis on all eigenvectors to obtain a rolling bearing fault diagnosis model. Thirdly, the efficient parallelization of ACO-K-Means clustering algorithm is implemented on a Spark platform, which can make full use of the computing resources of a cluster to efficiently process large-scale rolling bearing datasets in parallel. Extensive experiments are conducted to verify the effectiveness of the proposed fault diagnosis method. Experimental results show that the proposed method can not only achieve good fault diagnosis accuracy but also provide high model training efficiency and fault diagnosis efficiency in a big data environment.

show abstract

“…Recent methods for categorical data consider the cluster centers as the expectation of a random variable associated with the data, in the assumption that this variable follows a Gaussian distribution from the statistical point of view [13,14,17,18,[46][47][48]. The goal is to find a method that can guarantee the consistency in the statistical interpretation of the cluster centers for categorical data as the mean for numerical data.…”

Section: Examplementioning

confidence: 99%

“…Finding the solution for the above two challenges in categorical data clustering is not an easy task. Many clustering algorithms for categorical data have been designed to remove the limitation, while keeping the advantages of k-means [10,14,17,18,28,31,34,[46][47][48]51]. In general, they have the same scheme as kmeans, except that they use different ways to define cluster centers (cluster representatives) and distance measures for categorical data.…”

Section: Introductionmentioning

confidence: 99%

k-PbC: an improved cluster center initialization for categorical data clustering

Dinh

Huynh

2020

Appl Intell

Self Cite

View full text Add to dashboard Cite

The performance of a partitional clustering algorithm is influenced by the initial random choice of cluster centers. Different runs of the clustering algorithm on the same data set often yield different results. This paper addresses that challenge by proposing an algorithm named k-PbC, which takes advantage of non-random initialization from the view of pattern mining to improve clustering quality. Specifically, k-PbC first performs a maximal frequent itemset mining approach to find a set of initial clusters. It then uses a kernel-based method to form cluster centers and an information-theoretic based dissimilarity measure to estimate the distance between cluster centers and data objects. An extensive experimental study was performed on various real categorical data sets to draw a comparison between k-PbC and state-of-the-art categorical clustering algorithms in terms of clustering quality. Comparative results have revealed that the proposed initialization method can enhance clustering results and k-PbC outperforms compared algorithms for both internal and external validation metrics.

show abstract

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Cited by 100 publications

References 14 publications

Assessment of the Dependence of GHG Emissions on the Support and Taxes in the EU Countries

Assessment of the Dependence of GHG Emissions on the Support and Taxes in the EU Countries

A Novel Bearing Fault Diagnosis Method Using Spark-Based Parallel ACO-K-Means Clustering Algorithm

k-PbC: an improved cluster center initialization for categorical data clustering

Contact Info

Product

Resources

About