MapReduce Design of K-Means Clustering Algorithm

Anchalia, Prajesh P.; Koundinya, Anjan K; Srinath, N. K.

doi:10.1109/icisa.2013.6579448

Cited by 43 publications

(16 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…MapReduce is a framework which is illustrated in Fig. 7 initially represented by Google and Hadoop is an open source version of it [36]. In this section algorithms which are implemented based on this framework are reviewed and their improvements are discussed in terms of three features: …”

Section: Mapreducementioning

confidence: 99%

Big Data Clustering: A Review

Shirkhorshidi

Aghabozorgi

Wah

et al. 2014

Lecture Notes in Computer Science

171

View full text Add to dashboard Cite

Abstract.Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computational costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today's novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today's available technologies and frameworks.

show abstract

Section: Mapreducementioning

confidence: 99%

Big Data Clustering: A Review

Shirkhorshidi

Aghabozorgi

Wah

et al. 2014

Lecture Notes in Computer Science

171

View full text Add to dashboard Cite

show abstract

“…Parallel K-Means based on the MapReduce was proposed in [9,16] including three functions that are (1) a map function which takes care of calculating distance from each data sample to clusters and assigns this data sample to the closest cluster, (2) a combine function which calculates local centers before sending them to the reducing function, and, (3) a reduce function which obtains local centers and calculates global centers of each cluster.…”

Section: Parallel K-means Based On Mapreducementioning

confidence: 99%

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

Hieu

Meesad

2015

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Abstract.Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).

show abstract

“…Some other researchers designed and implemented many parallel data mining system in cloud computing ; as the most related work, Anchalia et al . designed k ‐means clustering using MapReduce .…”

Section: Related Workmentioning

confidence: 99%

A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

Liu

Xiao

Yang

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary When facing massive statistical data, the k‐means algorithm is very difficult to satisfy the need of data processing as it lacks an effective parallel mechanism. This paper proposes an improved k‐means algorithm (IMR‐KCA) to conduct clustering analysis based on medical data employing MapReduce computing framework. Through analyzing the defects of vast redundancy in the traditional k‐means algorithms, a selection model is firstly proposed to simplify the computations with multiple clustering centers. Based on several proposed theorems, we prove the correctness of this selection model. Second, this paper provides a method to calculate the distances from extreme points to central points, and the original Euclidean distance is replaced with Manhattan distance. For this simplification, a group of theorems are proposed to prove the correctness. Next, we provide a group of implementation algorithms to complete the parallelism of the clustering computation employing the MapReduce framework. Finally, the experimental results illustrate that IMR‐KCA is more reliable and efficient than the direct parallelization of the traditional clustering algorithms based on MapReduce. Copyright © 2017 John Wiley & Sons, Ltd.

show abstract

MapReduce Design of K-Means Clustering Algorithm

Cited by 43 publications

References 6 publications

Big Data Clustering: A Review

Big Data Clustering: A Review

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

Contact Info

Product

Resources

About