Fast clustering using MapReduce

Ene, Alina; Im, Sungjin; Moseley, Benjamin

doi:10.1145/2020408.2020515

Cited by 169 publications

(137 citation statements)

References 35 publications

Supporting

Mentioning

134

Contrasting

Order By: Relevance

“…Distributed versions of clustering algorithms related to kernel k-Means, like classic k-Means [86] and k-Medians [87] have already been proposed. However, to the best of our knowledge, a distributed approach to kernel k-Means has not been proposed yet.…”

Section: B Distributed Trimmed Kernel K-means Clusteringmentioning

confidence: 99%

Big Data Analysis for Media Production

et al. 2016

View full text Add to dashboard Cite

Abstract-A typical high-end film production generates several terabytes of data per day, either as footage from multiple cameras or as background information regarding the set (laser scans, spherical captures, etc). This paper presents solutions to improve the integration of the multiple data sources, and understand their quality and content, which are useful both to support creative decisions on-set (or near it) and enhance the post-production process. The main cinema specific contributions, tested on a multisource production dataset made publicly available for research purposes, are the monitoring and quality assurance of multi-camera set-ups, multisource registration and acceleration of 3D reconstruction, anthropocentric visual analysis techniques for semantic content annotation, and integrated 2D-3D web visualization tools. We discuss as well improvements carried out in basic techniques for acceleration, clustering and visualization, which were necessary to deal with the very large multisource data, and can be applied to other big data problems in diverse application fields.

show abstract

Section: B Distributed Trimmed Kernel K-means Clusteringmentioning

confidence: 99%

Big Data Analysis for Media Production

et al. 2016

View full text Add to dashboard Cite

show abstract

“…MapReduce framework is widely used for processing and managing large data sets in a distributed cluster, which has been used for numerous applications such as, document clustering, access log analysis, generating search indexes and various other data analytical operations. A host of literature is present in recent years for performing Big Data clustering using MapReduce framework [3,4,[13][14][15][16]. A modified K-means clustering algorithm based on MapReduce framework is proposed by Li et al [17] to perform clustering on large data sets.…”

Section: Background and Literature Reviewmentioning

confidence: 99%

Summarizing large text collection using topic modeling and clustering based on MapReduce framework

Nagwani

2015

Journal of Big Data

View full text Add to dashboard Cite

Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.

show abstract

“…Papadimitriou et al presented the distributed co-clustering framework which introduced practical approaches for distributed data preprocessing and co-clustering [11]. Ene et al proposed the fast clustering using MapReduce [12] by adopting a MapReduce sampling technique to decrease the data size. The result of this method was applied to -center andmedian algorithm.…”

Section: Related Workmentioning

confidence: 99%

“…We have discussed above, this algorithm needs too many MapReduce jobs. The research in [12] proposed a fast clustering scheme which uses the sapling technology. This paper also proved that the MapReduce-KCenter was 4 + 2 for the -center problem and MapReduce-KMedian was 10 + 3 approximation for the -median problem.…”

Section: Related Workmentioning

confidence: 99%

Efficient <inline-formula><tex-math>$k$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="qu-ieq1-2306193.gif"/></alternatives></inline-formula>-Means++ Approximation with MapReduce

et al. 2014

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract--means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and a proper initialization is crucial for obtaining an ideal solution. To overcome this problem, -means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, -means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce -means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain centers. The -means++ initialization algorithm is executed in the Mapper phase and the weighted -means++ initialization algorithm is run in the Reducer phase. As this new MapReduce -means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce -means++ method obtains ( 2 ) approximation to the optimal solution of -means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce -means++ method is much more efficient and can reach a good approximation.

show abstract

Fast clustering using MapReduce

Cited by 169 publications

References 35 publications

Big Data Analysis for Media Production

Big Data Analysis for Media Production

Summarizing large text collection using topic modeling and clustering based on MapReduce framework

Efficient <inline-formula><tex-math>$k$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="qu-ieq1-2306193.gif"/></alternatives></inline-formula>-Means++ Approximation with MapReduce

Contact Info

Product

Resources

About