We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log |V | · (1 + s · |V |/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20,8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computa- * This work was partially supported by the EU within the 6th Framework Programme under contract 001907 "Dynamically Evolving, Large Scale Information Systems" (DELIS) * Part of this work was done while the author was post-doc at Universitá degli Studi di Roma "La Sapienza" † Part of this work was done while the author was visiting the School of Computer Science at Carnegie Mellon UniversityPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tion of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8, 000 nodes and 40, 000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s = 1, 000, 10, 000, 100, 000, 1, 000, 000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s = 1, 000, 000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s = 10, 000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).
A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1, . . . , ∆} d [26]. We develop streaming (1 + )-approximation algorithms for k-median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points (a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the m insert/delete operations of the data stream. They use poly( −1 , log m, log ∆) space and update time per insert/delete operation for constant k and dimension d.Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in n. For example one can compute in poly(log n, exp(O((1+log(1/ )/ ) d−1 ))) time a solution to k-median and k-means [21] where n is the size of the current point set and k and d are constants. Finding an implicit solution to MaxCut can be done in poly(log n, exp((1/ ) O(1) )) time. For MaxST and average distance we require poly(log n, −1 ) time and for MaxWM we require O(n 3 ) time to do this.
In this paper we develop an efficient implementation for a k-means clustering algorithm. The novel feature of our algorithm is that it uses coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. The main strength of the algorithm is that it can quickly determine clusterings of the same point set for many values of k. This is necessary in many applications, since, typically, one does not know a good value for k in advance. Once we have clusterings for many different values of k we can determine a good choice of k using a quality measure of clusterings that is independent of k, for example the average silhouette coefficient. The average silhouette coefficient can be approximated using coresets.To evaluate the performance of our algorithm we compare it with algorithm KMHybrid [28] on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300, 000 to 4.9 million points. We show that our algorithm significantly outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates less than that of KMHybrid.We also computed clusterings and approximate average silhouette coefficient for k = 1, . . . , 100 for our input instances and discuss the performance of our algorithm in detail.
A dynamic geometric data stream is a sequence of m ADD/REMOVE operations of points from a discrete geometric space {1,…, Δ} d ?. ADD (p) inserts a point p from {1,…, Δ} d into the current point set P , REMOVE(p) deletes p from P . We develop low-storage data structures to (i) maintain ε-nets and ε-approximations of range spaces of P with small VC-dimension and (ii) maintain a (1 + ε)-approximation of the weight of the Euclidean minimum spanning tree of P . Our data structure for ε-nets uses [Formula: see text] bits of memory and returns with probability 1 – δ a set of [Formula: see text] points that is an e-net for an arbitrary fixed finite range space with VC-dimension [Formula: see text]. Our data structure for ε-approximations uses [Formula: see text] bits of memory and returns with probability 1 – δ a set of [Formula: see text] points that is an ε-approximation for an arbitrary fixed finite range space with VC-dimension [Formula: see text]. The data structure for the approximation of the weight of a Euclidean minimum spanning tree uses O ( log (1/δ)( log Δ/ε) O ( d )) space and is correct with probability at least 1 – δ. Our results are based on a new data structure that maintains a set of elements chosen (almost) uniformly at random from P .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.