We prove that the sum of the squared Euclidean distances from the n rows of an n × d matrix A to any compact set that is spanned by k vectors in R d can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε 2 )-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε 2 ) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n) O(jk) for any j, k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k
We study fair clustering problems as proposed by Chierichetti et al. [CKLV17]. Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well.We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable [IMMM14] and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.
The k-means problem consists of finding k centers in R d that minimize the sum of the squared distances of all points in an input set P from R d to their closest respective center. Awasthi et. al. recently showed that there exists a constant ε ′ > 1 such that it is NP-hard to approximate the k-means objective within a factor of c. We establish that the constant ε ′ is at least 1.0013.For a given set of points P ⊂ R d , the k-means problem consists of finding a partition of P into k clusters (C 1 , . . . , C k ) with corresponding centers (c 1 , . . . , c k ) that minimize the sum of the squared distances of all points in P to their corresponding center, i.e. the quantity arg minwhere || · || denotes the Euclidean distance. The k-means problem has been well-known since the fifties, when Lloyd [Llo57] developed the famous local search heuristic also known as the k-means algorithm. Various exact, approximate, and heuristic algorithms have been developed since then. For a constant number of clusters k and a constant dimension d, the problem can be solved by enumerating weighted Voronoi diagrams [IKI94]. If the dimension is arbitrary but the number of centers is constant, many polynomialtime approximation schemes are known. For example, [FL11] gives an algorithm with running time O(nd + 2 poly(1/ε,k) ). In the general case, only constant-factor approximation algorithms are known [JV01, KMN + 04], but no algorithm with an approximation ratio smaller than 9 has yet been found. Surprisingly, no hardness results for the k-means problem were known even as recently as ten years ago. Today, it is known that the k-means problem is NP-hard, even for constant k and arbitrary dimension d [ADHP09, Das08] and also for arbitrary k and
We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means ++ [5] on S.We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM ++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.