Clustering is one of the most popular methods in data mining. Many algorithms can be applied for data clustering with numeric or categorical attributes. However, most of data in the real world contain both numeric and categorical attributes. A clustering method which can be applied on attributes in mix types become important to handle the problem. K-prototypes algorithm is one of the algorithms which can deal for clustering data with mix attribute types. However, it has a drawback on its dissimilarity measure between categorical data. The selection of proper dissimilarity measure between categorical data is thus important to increase its performance. This paper compares distance and dissimilarity measures for clustering data with mix attribute types. We used the k-prototypes algorithm on VCI datasets, i.e. Echocardiogram, Hepatitis, and Zoo, to assign cluster membership of the objects. Silhouette index was employed to evaluate clustering results. The results show that Euclidean distance and Ratio on Mismatches dissimilarity are the best combination for clustering data with numeric and categorical attribute types, as it shown with average Silhouette index towards 1. As a result, to cluster data with mix attribute types, we propose to employ Euclidean distance and Ratio on Mismatches dissimilarity to be applied on k-prototypes algorithm.
The clustering algorithm can group regions based on economic potential with mixed attributes data, consisting of numeric and categorical data. This study aims to group villages according to their economic potential in determining village development targets in Demak Regency using the fuzzy k-prototypes algorithm and modified Eskin distance to measure the distance of categorical attributes. The data used are PODES2018 data and the 2019 Wilkerstat Mapping. Village clustering produces three village clusters according to their economic potential, namely low, medium, and high economic clusters. Clusters of high economic potential are located on the main transportation routes of Semarang–Kudus and Semarang–Grobogan. However, villages on the main transportation route are still included in the low economic cluster. Considering the status of the urban/rural village classification, most of these villages are included in the urban village category. The results of this clustering can be used to determine village development targets in increasing the Village Developing Index in Demak Regency.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.