Issues of K Means Clustering While Migrating to Map Reduce Paradigm with Big Data: A Survey

Nirmal, Khyati R.; Satyanarayana, K.

doi:10.11591/ijece.v6i6.11207

Cited by 8 publications

(4 citation statements)

References 9 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Variable selection using feature selection. The result of selecting variables using the toolsjupyter notebook in Figure 4.1 The optimal number of k in this study used the Elbow method, because k-Means has a weakness in determining the number of initial clusters determined randomly [8]. The best number of k for clusters 1 to 10 using the Elbow Method is k=2.…”

Section: Data Preprocessingmentioning

confidence: 99%

“…Fuzzy C-Means algorithm has a faster and easier process time to interpret [6], n however, it has weaknesses in the calculation process and fuzzy iterations that use longer time than the K-Means algorithm [7]. The K-Means algorithm is widely applied to research because it is more efficient in categorizing data with very large amounts, but this algorithm is not quite right in random selection of centroid starting points and determining the initial number of clusters [8].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Clustering Villages Based on Distance and Accessibility to Health Facilities Using the K-Means Method

Noviandi

Noviantika

Irawan

2022

JTOS

View full text Add to dashboard Cite

There are 47 very underdeveloped and 63 underdeveloped villages in Melawi regency. More than 50% of the villages have no health facilities, and the percentage of road lengths with good condition is only 20.53% in Melawi County. One of the most important factors influencing health problems is the physical aspect such as the availability of health facilities. In addition, the distance and easy access to health facilities also influence how quickly people are treated and vaccinated during the Covid 19 pandemic. The objective of this study is to determine the degree of accessibility of health facilities in villages by forming village clusters that are likely to be important to the government in ensuring treatment and distribution of Covid 19 vaccine. The clustering method used is the K-Means method with Euclidean spacing to calculate the spacing of the data and the Elbow method to determine the optimal number of clusters on the data, and the Silhouette coefficient evaluation method to test the degree of accuracy of the model created with K-Means. The results of the Elbow method showed the optimal number of clusters to be 2 clusters. Based on the results of the K-Means algorithm process, the clusters that have a larger average distance and access is rated as difficult are cluster 1 with 92 villages in it, and cluster 1 has a smaller average distance and access is relatively easy with 77 villages in it. The result of the evaluation with the silhouette coefficient is 0.299.

show abstract

Section: Data Preprocessingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Clustering Villages Based on Distance and Accessibility to Health Facilities Using the K-Means Method

Noviandi

Noviantika

Irawan

2022

JTOS

View full text Add to dashboard Cite

show abstract

“…Mapreduce [12] can process data in parallel by the use of map and reduce phase. Kmeans is deployed on Mapreduce with parallel calculation of clusters for processing large scale of data [4], [14], [15]. Similarity between data objects and clusters are different for every object.…”

Section: Proposed Techniquementioning

confidence: 99%

A Novel Approach for Clustering Big Data based on MapReduce

Bathla

Aggarwal

Rani

2018

IJECE

View full text Add to dashboard Cite

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

show abstract

“…The methods of data mining, among others clustering methods, classification methods, etc., are needed to extract or mine the knowledge from large amounts of data. To group the data in accordance with their multiple-characteristic based similarities is known as clustering [1].…”

Section: Introductionmentioning

confidence: 99%

A Preference Model on Adaptive Affinity Propagation

Refianti

Mutiara

Juarna

et al. 2018

IJECE

View full text Add to dashboard Cite

In recent years, two new data clustering algorithms have been proposed. One of them isAffinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.

show abstract

Issues of K Means Clustering While Migrating to Map Reduce Paradigm with Big Data: A Survey

Cited by 8 publications

References 9 publications

Clustering Villages Based on Distance and Accessibility to Health Facilities Using the K-Means Method

Clustering Villages Based on Distance and Accessibility to Health Facilities Using the K-Means Method

A Novel Approach for Clustering Big Data based on MapReduce

A Preference Model on Adaptive Affinity Propagation

Contact Info

Product

Resources

About