Summary
When facing massive statistical data, the k‐means algorithm is very difficult to satisfy the need of data processing as it lacks an effective parallel mechanism. This paper proposes an improved k‐means algorithm (IMR‐KCA) to conduct clustering analysis based on medical data employing MapReduce computing framework. Through analyzing the defects of vast redundancy in the traditional k‐means algorithms, a selection model is firstly proposed to simplify the computations with multiple clustering centers. Based on several proposed theorems, we prove the correctness of this selection model. Second, this paper provides a method to calculate the distances from extreme points to central points, and the original Euclidean distance is replaced with Manhattan distance. For this simplification, a group of theorems are proposed to prove the correctness. Next, we provide a group of implementation algorithms to complete the parallelism of the clustering computation employing the MapReduce framework. Finally, the experimental results illustrate that IMR‐KCA is more reliable and efficient than the direct parallelization of the traditional clustering algorithms based on MapReduce. Copyright © 2017 John Wiley & Sons, Ltd.