Christian Sohler scite author profile

We prove that the sum of the squared Euclidean distances from the n rows of an n × d matrix A to any compact set that is spanned by k vectors in R d can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε 2 )-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε 2 ) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n) O(jk) for any j, k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k

show abstract

Clustering for metric and nonmetric distance measures

Ackermann

Blömer

Sohler

2010

ACM Trans. Algorithms

242

View full text Add to dashboard Cite

We study a generalization of the k -median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n , our goal is to find a set C of size k such that the sum of errors D( P,C ) = ∑ p ∈ P min c ∈ C {D( p,c )} is minimized. The main result in this article can be stated as follows: There exists a (1+ϵ)-approximation algorithm for the k -median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+ϵ) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n 2 O ( mk log( mk /ϵ)), where m is a constant that depends only on ϵ and D. Using this characterization, we obtain the first linear time (1+ϵ)-approximation algorithms for the k -median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k -median problem and the Euclidean k -means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].

show abstract

StreamKM++: A Clustering Algorithm for Data Streams

Ackermann¹,

Lammersen²,

Märtens³

et al. 2010

105

194

View full text Add to dashboard Cite

Counting triangles in data streams

et al. 2006

View full text Add to dashboard Cite

We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log |V | · (1 + s · |V |/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20,8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computa- * This work was partially supported by the EU within the 6th Framework Programme under contract 001907 "Dynamically Evolving, Large Scale Information Systems" (DELIS) * Part of this work was done while the author was post-doc at Universitá degli Studi di Roma "La Sapienza" † Part of this work was done while the author was visiting the School of Computer Science at Carnegie Mellon UniversityPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tion of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8, 000 nodes and 40, 000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s = 1, 000, 10, 000, 100, 000, 1, 000, 000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s = 1, 000, 000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s = 10, 000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).

show abstract

A PTAS for k-means clustering based on weak coresets

2007

View full text Add to dashboard Cite

Subspace embeddings for the L ₁ -norm with applications

Sohler

Woodruff

2011

125

View full text Add to dashboard Cite

We show there is a distribution over linear mappings R :, such that with arbitrarily large constant probability, forThis provides the first analogue of the ubiquitous subspace JohnsonLindenstrauss embedding for the 1-norm. Importantly, the target dimension and distortion are independent of the ambient dimension n. We give several applications of this result. First, we give a faster algorithm for computing wellconditioned bases. Our algorithm is simple, avoiding the linear programming machinery required of previous algorithms. We also give faster algorithms for least absolute deviation regression and 1-norm best fit hyperplane problems, as well as the first single pass streaming algorithms with low space for these problems. These results are motivated by practical problems in image analysis, spam detection, and statistics, where the 1-norm is used in studies where outliers may be safely and effectively ignored. This is because the 1-norm is more robust to outliers than the 2-norm.

show abstract

Every Property of Hyperfinite Graphs Is Testable

Newman¹,

Sohler²

2013

SIAM J. Comput.

108

View full text Add to dashboard Cite

Abstract. A k-disc around a vertex v of a graph G = (V, E) is the subgraph induced by all vertices of distance at most k from v. We show that the structure of a planar graph on n vertices, and with constant maximum degree d, is determined, up to the modification (insertion or deletion) of at most dn edges, by the frequency of k-discs for certain k = k( , d) that is independent of the size of the graph. We can replace planar graphs by any hyperfinite class of graphs, which includes, for example, every graph class that does not contain a set of forbidden minors.A pure combinatorial consequence of this result is that two d-bounded degree graphs that have similar frequency vectors (that is, the 1 difference between the frequency vectors is small) are close to be isomorphic (where close here means that by inserting / deleting not too many edges in one of them, it becomes isomorphic to the other).We also obtain the following new results in the area of property testing, which are essentially equivalent to the above statement. We prove that• graph isomorphism is testable for every class of hyperfinite graphs,• every graph property is testable for every class of hyperfinite graphs,• every hyperfinite graph property is testable in the bounded degree graph model,• A large class of graph parameters is approximable for hyperfinite graphs. Our results also give a partial explanation of the success of motifs in the analysis of complex networks.

show abstract

Coresets in dynamic geometric data streams

2005

View full text Add to dashboard Cite

A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1, . . . , ∆} d [26]. We develop streaming (1 + )-approximation algorithms for k-median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points (a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the m insert/delete operations of the data stream. They use poly( −1 , log m, log ∆) space and update time per insert/delete operation for constant k and dimension d.Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in n. For example one can compute in poly(log n, exp(O((1+log(1/ )/ ) d−1 ))) time a solution to k-median and k-means [21] where n is the size of the current point set and k and d are constants. Finding an implicit solution to MaxCut can be done in poly(log n, exp((1/ ) O(1) )) time. For MaxST and average distance we require poly(log n, −1 ) time and for MaxWM we require O(n 3 ) time to do this.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.