We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means ++ [5] on S.We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM ++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
One of the most fundamental questions in graph property testing is to characterize the combinatorial structure of properties that are testable with a constant number of queries. We work towards an answer to this question for the bounded-degree graph model introduced in [GR02], where the input graphs have maximum degree bounded by a constant d. In this model, it is known (among other results) that every hyperfinite property is constant-query testable [NS13], where, informally, a graph property is hyperfinite, if for every δ > 0 every graph in the property can be partitioned into small connected components by removing δn edges.In this paper we show that hyperfiniteness plays a role in every testable property, i.e. we show that every testable property is either finite (which trivially implies hyperfiniteness and testability) or contains an infinite hyperfinite subproperty. A simple consequence of our result is that no infinite graph property that only consists of expander graphs is constant-query testable.Based on the above findings, one could ask if every infinite testable non-hyperfinite property might contain an infinite family of expander (or near-expander) graphs. We show that this is not true. Motivated by our counter-example we develop a theorem that shows that we can partition the set of vertices of every bounded degree graph into a constant number of subsets and a separator set, such that the separator set is small and the distribution of k-discs on every subset of a partition class, is roughly the same as that of the partition class if the subset has small expansion.
We study fine-grained error bounds for differentially private algorithms for averaging and counting in the continual observation model. For this, we use the completely bounded spectral norm (cb norm) from operator algebra. For a matrix W, its cb norm is defined aswhere Q • W denotes the Schur product and • denotes the spectral norm. We bound the cb norm of two fundamental matrices studied in differential privacy under the continual observation model: the counting matrix M counting and the averaging matrix M average . For M counting , we give lower and upper bound whose additive gap is 1 + 1 π . Our factorization also has two desirable properties sufficient for streaming setting: the factorization contains of lower-triangular matrices and the number of distinct entries in the factorization is exactly T. This allows us to compute the factorization on the fly while requiring the curator to store a T-dimensional vector. For M average , we show an additive gap between the lower and upper bound of ≈ 0.64.
In fully dynamic clustering problems, a clustering of a given data set in a metric space must be maintained while it is modified through insertions and deletions of individual points. In this paper, we resolve the complexity of fully dynamic k-center clustering against both adaptive and oblivious adversaries. Against oblivious adversaries, we present the first algorithm for fully dynamic k-center in an arbitrary metric space that maintains an optimal (2 + )-approximation in O(k • polylog(n, ∆)) amortized update time. Here, n is an upper bound on the number of active points at any time, and ∆ is the aspect ratio of the metric space. Previously, the best known amortized update time was O(k 2 • polylog(n, ∆)), and is due to Chan, Gourqin, and Sozio (2018). Moreover, we demonstrate that our runtime is optimal up to polylog(n, ∆) factors. In fact, we prove that even offline algorithms for k-clustering tasks in arbitrary metric spaces, including k-medians, k-means, and k-center, must make at least Ω(nk) distance queries to achieve any non-trivial approximation factor. This implies a lower bound of Ω(k) which holds even for the insertions-only setting.For adaptive adversaries, we give the first deterministic algorithm for fully dynamic k-center which achieves a O min log(n/k) log log n , k approximation in O(k • polylog(n, ∆)) amortized update
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.