Abstract:Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows … Show more
“…Distributed versions of clustering algorithms related to kernel k-Means, like classic k-Means [86] and k-Medians [87] have already been proposed. However, to the best of our knowledge, a distributed approach to kernel k-Means has not been proposed yet.…”
Section: B Distributed Trimmed Kernel K-means Clusteringmentioning
Abstract-A typical high-end film production generates several terabytes of data per day, either as footage from multiple cameras or as background information regarding the set (laser scans, spherical captures, etc). This paper presents solutions to improve the integration of the multiple data sources, and understand their quality and content, which are useful both to support creative decisions on-set (or near it) and enhance the post-production process. The main cinema specific contributions, tested on a multisource production dataset made publicly available for research purposes, are the monitoring and quality assurance of multi-camera set-ups, multisource registration and acceleration of 3D reconstruction, anthropocentric visual analysis techniques for semantic content annotation, and integrated 2D-3D web visualization tools. We discuss as well improvements carried out in basic techniques for acceleration, clustering and visualization, which were necessary to deal with the very large multisource data, and can be applied to other big data problems in diverse application fields.
“…Distributed versions of clustering algorithms related to kernel k-Means, like classic k-Means [86] and k-Medians [87] have already been proposed. However, to the best of our knowledge, a distributed approach to kernel k-Means has not been proposed yet.…”
Section: B Distributed Trimmed Kernel K-means Clusteringmentioning
Abstract-A typical high-end film production generates several terabytes of data per day, either as footage from multiple cameras or as background information regarding the set (laser scans, spherical captures, etc). This paper presents solutions to improve the integration of the multiple data sources, and understand their quality and content, which are useful both to support creative decisions on-set (or near it) and enhance the post-production process. The main cinema specific contributions, tested on a multisource production dataset made publicly available for research purposes, are the monitoring and quality assurance of multi-camera set-ups, multisource registration and acceleration of 3D reconstruction, anthropocentric visual analysis techniques for semantic content annotation, and integrated 2D-3D web visualization tools. We discuss as well improvements carried out in basic techniques for acceleration, clustering and visualization, which were necessary to deal with the very large multisource data, and can be applied to other big data problems in diverse application fields.
“…MapReduce framework is widely used for processing and managing large data sets in a distributed cluster, which has been used for numerous applications such as, document clustering, access log analysis, generating search indexes and various other data analytical operations. A host of literature is present in recent years for performing Big Data clustering using MapReduce framework [3,4,[13][14][15][16]. A modified K-means clustering algorithm based on MapReduce framework is proposed by Li et al [17] to perform clustering on large data sets.…”
Section: Background and Literature Reviewmentioning
Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.
“…Papadimitriou et al presented the distributed co-clustering framework which introduced practical approaches for distributed data preprocessing and co-clustering [11]. Ene et al proposed the fast clustering using MapReduce [12] by adopting a MapReduce sampling technique to decrease the data size. The result of this method was applied to -center andmedian algorithm.…”
Section: Related Workmentioning
confidence: 99%
“…We have discussed above, this algorithm needs too many MapReduce jobs. The research in [12] proposed a fast clustering scheme which uses the sapling technology. This paper also proved that the MapReduce-KCenter was 4 + 2 for the -center problem and MapReduce-KMedian was 10 + 3 approximation for the -median problem.…”
Abstract--means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and a proper initialization is crucial for obtaining an ideal solution. To overcome this problem, -means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, -means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce -means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain centers. The -means++ initialization algorithm is executed in the Mapper phase and the weighted -means++ initialization algorithm is run in the Reducer phase. As this new MapReduce -means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce -means++ method obtains ( 2 ) approximation to the optimal solution of -means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce -means++ method is much more efficient and can reach a good approximation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.