We consider Maximal Clique Enumeration (MCE) from a large graph. A maximal clique is perhaps the most fundamental dense substructure in a graph, and MCE is an important tool to discover densely connected subgraphs, with numerous applications to data mining on web graphs, social networks, and biological networks. While effective sequential methods for MCE are known, scalable parallel methods for MCE are still lacking.We present a new parallel algorithm for MCE, Parallel Enumeration of Cliques using Ordering (PECO" role="presentation" style="box-sizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; maxwidth: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECO), designed for the MapReduce framework. Unlike previous works, which required a post-processing step to remove duplicate and non-maximal cliques, PECO" role="presentation" style="boxsizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; wordwrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; minwidth: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECOenumerates only maximal cliques with no duplicates. The key technical ingredient is a total ordering of the vertices of the graph which is used in a novel way to achieve a load balanced distribution of work, and to eliminate redundant work among processors. We implemented PECO" role="presentation" style="box-sizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECO on Hadoop MapReduce, and our experiments on a cluster show that the algorithm can effectively process a variety of large real-world graphs with millions of vertices and tens of millions of maximal cliques, and scales well with the degree of available parallelism. KeywordsGraph mining, Maximal clique enumeration, Enumeration algorithm, MapReduce, Hadoop, Parallel algorithm, Clique, Load balancing Disciplines Electrical and Computer EngineeringComments This is a manuscript of an article from Svendsen, Michael, Arko Provo Mukherjee, and Srikanta Tirthapura. "Mining maximal cliques from a large graph using mapreduce: Tackling highly uneven subproblem sizes. h i g h l i g h t s• Scalable method for enumerating maximal cliques in a graph using MapReduce.• Effective solution to load balancing.• Experimental evaluation of our solution on large real world graphs.• Outperforms previous MapReduce solutions by orders of magnitude. a r t i c l e i n f o b s t r a c tWe consider Maximal Clique Enumeration (MCE) from a large graph. A maximal clique is perhaps the most fundamental dense substru...
We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters.We consider queries of the following form: "In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H". We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.
We consider the enumeration of dense substructures (maximal cliques) from an uncertain graph. For parameter 0 ;a ;1, we define the notion of an a-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of a-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate a-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm. KeywordsRuntime, Proteins, Algorithm design and analysis, Social network services, Uncertainty, Bioinformatics Disciplines Electrical and Computer EngineeringComments This is a manuscript of an article published as Mukherjee, Arko Provo, Pan Xu, and Srikanta Tirthapura. Abstract-We consider the enumeration of dense substructures (maximal cliques) from an uncertain graph. For parameter 0 < a < 1, we define the notion of an a-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of a-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate a-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm.
We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many data mining problems arising in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce framework, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through a task assignment that is based on an appropriate total order among the vertices. We show theoretically that our algorithm is work optimal, i.e., it performs the same total work as its sequential counterpart. We present a detailed evaluation which shows that the algorithm scales to large graphs with millions of edges and tens of millions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale. Abstract-We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many data mining problems arising in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce framework, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs:(1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through a task assignment that is based on an appropriate total order among the vertices. We show theoretically that our algorithm is work optimal, i.e., it performs the same total work as its sequential counterpart. We present a detailed evaluation which shows that the algorithm scales to large graphs with millions of edges and tens of millions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.
We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 we consider the notion of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate α-maximal cliques whose worstcase runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm. KeywordsRuntime, Algorithm design and analysis, Proteins, Social network services, Communities, Data mining, Moon Disciplines Electrical and Computer Engineering CommentsThis is a manuscript of a proceeding published as Mukherjee, Arko Provo, Pan Xu, and Srikanta Tirthapura. Abstract-We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 < α < 1, we consider the notion of an α-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of α-maximal cliques possible within a (uncertain) graph. We present an algorithm to enumerate α-maximal cliques whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.