Optimized big data K-means clustering using MapReduce

Cui, Xiaoli; Zhu, Pingfei; Yang, Xin; Li, Keqiu; Ji, Changqing

doi:10.1007/s11227-014-1225-7

Cited by 128 publications

(53 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The iteration dependence was eliminated, and high performance was obtained by using the processing model. Extensive experiments demonstrate that the proposed methods were efficient, robust, and scalable .…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SC‐OCR: similarity‐based clustering and optimum cache replacement approach

Subramanian

Soundarajan²

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Big data is a new term used to identify the large scale and complex datasets. Big data is now rapidly expanding in all science and engineering domains, owing to the fast development of networking, data storage, and data collection capacity. Big data mining is the capability of extracting useful information from these large datasets. Nowadays, the integration of cloud computing with the data mining for the big data mining process is a challenging task. In order to process the huge amount of data, it is necessary to concentrate the improvement on the big data computation. Most of the existing approaches use the MapReduce to compute the big data. The increase in the computational cost and memory consumption are the main drawbacks of the existing approaches. To overcome these limitations, this paper proposes a similarity‐based clustering and optimum cache replacement approach for big data computing applications. The job recovery process is initiated by copying the data in the cloud server and forwarding the data copy for further processing. Then, the job is divided into clusters based on the similarity‐based clustering approach. Finally, the cache concept is introduced with the optimum cache replacement algorithm to avoid repeated execution of the jobs by queue management. The proposed approach is compared with the existing Spark and Hadoop approaches. The proposed approach achieves better performance in terms of iteration time, query response time, job completion time, and clustering accuracy. Copyright © 2016 John Wiley & Sons, Ltd.

show abstract

Section: Related Workmentioning

confidence: 99%

“…MapReduce [1][2][3][4][5][6] is a programming model aimed for parallel processing of large volumes of data by dividing the work into a set of independent tasks. In the MapReduce framework [7][8][9], a distributed file system (DFS) performs initial partitioning of data in multiple machines and represents data as pairs.…”

Section: Introductionmentioning

confidence: 99%

SC‐OCR: similarity‐based clustering and optimum cache replacement approach

Subramanian

Soundarajan²

2016

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…At present, there are a lot of improvements in the selection of initial clustering centers for K-means algorithm [5][6][7]. (1) The method of maximum and minimum distance is used to select the clustering center point then calculate the K value and find a reasonable center point.…”

Section: Related Workmentioning

confidence: 99%

An Optimization Algorithm of Selecting Initial Clustering Center in K - means

Xue¹

2017

Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)

View full text Add to dashboard Cite

Abstract:The traditional stand-alone K-means clustering algorithm has the limitation of time consumption and memory overflow when dealing with large-scale data. Although this problem is solved with the help of MapReduce framework. However, the clustering accuracy effect is not stable due to the selection of initial clustering center. Therefore, this paper presents an algorithm for optimizing the initial clustering center in K-means by using several equal-scale sampling, calculating the local density and selecting the optimal initial clustering center. The experimental results show that the optimized algorithm shortens the clustering time and improves the accuracy and stability of clustering procedure in K-means.

show abstract

“…Third, from the contents of documents, arbitrary semantic structures can be extracted. As the number of documents without annotations (i.e., unstructured texts) is growing exponentially, it is preferred to take unsupervised methods [5,23,31]. Topic modeling is one of such methods, and it captures the latent semantic structures across the documents.…”

Section: Introductionmentioning

confidence: 99%

Discovery of topic flows of authors

et al. 2017

View full text Add to dashboard Cite

With an increase in the number of Web documents, the number of proposed methods for knowledge discovery on Web documents have been increased as well. The documents do not always provide keywords or categories, so unsupervised approaches are desirable, and topic modeling is such an approach for knowledge discovery without using labels. Further, Web documents usually have time information such as publish years, so knowledge patterns over time can be captured by incorporating the time information. The temporal patterns of knowledge can be used to develop useful services such as a graph of research trends, finding similar authors (potential co-authors) to a particular author, or finding top researchers about a specific research domain. In this paper, we propose a new topic model, Author Topic-Flow (ATF) model, whose objective is to capture temporal patterns of research interests of authors over time, where each topic is associated with a research domain. putes the temporal patterns of authors by combining the patterns of topics. We believe that such 'indirect' temporal patterns will be poor than the 'direct' temporal patterns of our proposed model. The ATF model allows each author to have a separated variable which models the temporal patterns, so we denote it as 'direct' topic flow. The design of the ATF model is based on the hypothesis that 'direct' topic flows will be better than the 'indirect' topic flows. We prove the hypothesis is true by a structural comparison between the two models and show the effectiveness of the ATF model by empirical results.

show abstract

Optimized big data K-means clustering using MapReduce

Cited by 128 publications

References 14 publications

SC‐OCR: similarity‐based clustering and optimum cache replacement approach

SC‐OCR: similarity‐based clustering and optimum cache replacement approach

An Optimization Algorithm of Selecting Initial Clustering Center in K - means

Discovery of topic flows of authors

Contact Info

Product

Resources

About