Clustering is technique which is used to analyze the data in efficient manner and generate required information. To cluster the dataset, there is a technique named k-mean, is applied which is based on central point selection and calculation of Euclidian Distance. Here in k-mean, dataset will be loaded and from the dataset. Central points are selected using the formulae Euclidian distance and on the basis of Euclidian distance points are assigned to the clusters. The main disadvantage of k-mean is of accuracy, as in k-mean clustering user needs to define number of clusters. Because of user defined number of clusters, some points of the dataset are remained un-clustered. In this work, improvement in the kmean clustering algorithm will be proposed which can define number of clusters automatically and assign required cluster to un-clustered points. The proposed improvement will leads to improvement in accuracy and reduce clustering time by the member assigned to the cluster to predict cancer.
The set of objects having same characteristics are organized in groups and clusters of these objects are formed known as Data Clustering.It is an unsupervised learning technique for classification of data. K-means algorithm is widely used and famous algorithm for analysis of clusters.In this algorithm, n number of data points are divided into k clusters based on some similarity measurement criterion. K-Means Algorithm has fast speed and thus is used commonly clustering algorithm. Vector quantization,cluster analysis,feature learning are some of the application of K-Means.However results generated using this algorithm are mainly dependant on choosing initial cluster centroids.The main shortcome of this algorithm is to provide appropriate number of clusters.Provision of number of clusters before applying the algorithm is highly impractical and requires deep knowledge of clustering field. In this project, we are going to propose an algorithm for improvement in the initializing the centroids for K-Means algorithm. We are going to work on numerical data sets along with the categorical datasets with the n dimensions. For similarity measurement we are going to consider the manhattan distance ,Dice distance and cosine distance. The result of this proposed algorithm will be compared with the original K-Means.Also the quality and complexity of the proposed algorithm will be checked with the existing algorithm Index Terms-Data Clustering, K-Means,unsupervised learning,centroid.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.