Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means| | obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld large-scale data demonstrates that k-means| | outperforms k-means++ in both sequential and parallel settings.
Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p -system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p -system constraints.
Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC 0 , a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested. which renders sequential algorithms unusable. In situations where the amount of data is prohibitively large, the MapReduce [16] programming paradigm is used to overcome this obstacle. MapReduce and its open source counterpart Hadoop [33] are distributed computing frameworks designed to process massive data sets.The MapReduce model is quite novel, since it interleaves sequential and parallel computation. Succinctly, MapReduce consists of several rounds of computation. There is a set of machines, each of which has a certain amount of memory available. The memory on each machine is limited, and there is no communication between the machines during a round. In each round, the data is distributed among the machines. The data assigned to a single machine is constrained to be sub-linear in the input size. This restriction is motivated by the fact that the input size is assumed to be very large [26,15]. After the data is distributed, each of the machines performs some computation on the data that is available to them. The output of these computations is either the final result or it becomes the input of another MapReduce round. A more precise overview of the MapReduce model is given in Section 1.1. Problems:In this paper, we are concerned with designing clustering algorithms that can be implemented using MapReduce. In particular, we focus on two well-studied problems: metric k-median and k-center. In both of these problems, we are given a set V of n points, together with the distances between any pair of points; we give a precise description of the input representation below. The goal is to choose k of the points. Each of the k chosen points represents a cluster and is refer...
No abstract
Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle’s geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by #subw. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw. We present three applications of our FAQ-AI framework to relational machine learning: k -means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.