Clustering Large Graphs via the Singular Value Decomposition

Drineas, Petros; Frieze, Alan; Kannan, Ravi; Vempala, Santosh; Vinay, V.

doi:10.1023/b:mach.0000033113.59016.96

Cited by 378 publications

(271 citation statements)

References 30 publications

Supporting

Mentioning

262

Contrasting

Unclassified

Order By: Relevance

“…This approach is a greedy algorithm that tries to solve the problem of maximizing σ k for each k. But this problem is known to be NP-hard: even for a given k, maximizing σ k is the NP-hard "K-Median clustering problem" [10,8] for K = (n − k) clusters. The existing approximation algorithms [10,8] are exponential with the number of clusters to find and unsuitable for our purpose.…”

Section: The Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Computing Communities in Large Networks Using Random Walks

Pons

Latapy

2005

Computer and Information Sciences - ISCIS 2005

1,745

1,507

View full text Add to dashboard Cite

Dense subgraphs of sparse graphs (communities), which appear in most real-world complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advantages: it captures well the community structure in a network, it can be computed efficiently, it works at various scales, and it can be used in an agglomerative algorithm to compute efficiently the community structure of a network. We propose such an algorithm which runs in time O(mn 2 ) and space O(n 2 ) in the worst case, and in time O(n 2 log n) and space O(n 2 ) in most real-world cases (n and m are respectively the number of vertices and edges in the input graph). Experimental evaluation shows that our algorithm surpasses previously proposed ones concerning the quality of the obtained community structures and that it stands among the best ones concerning the running time. This is very promising because our algorithm can be improved in several ways, which we sketch at the end of the paper.

show abstract

Section: The Algorithmmentioning

confidence: 99%

“…The existing approximation algorithms [10,8] are exponential with the number of clusters to find and unsuitable for our purpose. So for each pair of adjacent communities {C 1 , C 2 }, we compute the variation ∆σ(C 1 , C 2 ) of σ if we would merge C 1 and C 2 into a new community…”

Section: The Algorithmmentioning

confidence: 99%

Computing Communities in Large Networks Using Random Walks

Pons

Latapy

2005

Computer and Information Sciences - ISCIS 2005

1,745

1,507

View full text Add to dashboard Cite

show abstract

“…In the case of "power-law" networks it was shown in [32] that the spectral counting of triangles can be efficient due to their special spectral properties and [33] extended this idea using the randomized algorithm by [12] by proposing a simple biased node sampling. This algorithm can be viewed as a special case of a streaming algorithm, since there exist algorithms, e.g., [29], that perform a constant number of passes over the non-zero elements of the matrix to produce a good low rank matrix approximation.…”

Section: Existing Workmentioning

confidence: 99%

Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning

Kolountzakis

Peng

Tsourakakis

2010

Algorithms and Models for the Web-Graph

View full text Add to dashboard Cite

Abstract. The number of triangles is a computationally expensive graph statistic which is frequently used in complex network analysis (e.g., transitivity ratio), in various random graph models (e.g., exponential random graph model) and in important real world applications such as spam detection, uncovering of the hidden thematic structure of the Web and link recommendation. Counting triangles in graphs with millions and billions of edges requires algorithms which run fast, use small amount of space, provide accurate estimates of the number of triangles and preferably are parallelizable. In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model [15]. The key idea of our algorithm is to combine the sampling algorithm of [34,35] and the partitioning of the set of vertices into a high degree and a low degree subset respectively as in [2], treating each set appropriately. We obtain a running time O m + m 3/2 ∆ log n tǫ 2 and an ǫ approximation (multiplicative error), where n is the number of vertices, m the number of edges and ∆ the maximum number of triangles an edge is contained. Furthermore, we show how this algorithm can be adapted to the semistreaming model with space usage O m 1/2 log n + m 3/2 ∆ log n tǫ 2 and a constant number of passes (three) over the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results. Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance.

show abstract

“…It should be noted, however, that there are of course other options, and that alternative measures can indeed be found in literature. Anyway, as a reasonable feature of (11), note that it is a normalized measure between 0 and 1, where the latter value is assumed for perfectly identical structures. This property is often violated for fuzzifications of standard (relative) evaluation measures such as, e.g., those based on the comparison of coincidence matrices.…”

Section: Similarity Between Cluster Modelsmentioning

confidence: 99%

Fuzzy Clustering of Parallel Data Streams

Beringer

Hüllermeier

2007

Advances in Fuzzy Clustering and Its Applications

View full text Add to dashboard Cite

The management and processing of so-called data streams has recently become a topic of active research in several fields of computer science, notably database systems and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. More specifically, we are interested in grouping data streams the evolution over time of which is similar in a specific sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an efficient online version of the fuzzy C-means clustering algorithm. A fuzzy approach appears to be particularly useful for this type of application, in which the clustering structure is subject to continuous changes.

show abstract

Clustering Large Graphs via the Singular Value Decomposition

Cited by 378 publications

References 30 publications

Computing Communities in Large Networks Using Random Walks

Computing Communities in Large Networks Using Random Walks

Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning

Fuzzy Clustering of Parallel Data Streams

Contact Info

Product

Resources

About