Abstract:We propose Distributed Neighbor Expansion (Distributed NE), a parallel and distributed graph partitioning method that can scale to trillion-edge graphs while providing high partitioning quality. Distributed NE is based on a new heuristic, called parallel expansion, where each partition is constructed in parallel by greedily expanding its edge set from a single vertex in such a way that the increase of the vertex cuts becomes local minimal. We theoretically prove that the proposed method has the upper bound in … Show more
“…NE selects edge gradually to fully fill each partition and this approach performs quite well. M. Hanai et al [14] proposed a follow-up study. They reformed NE to distribute its approach, this proposition can process trillion-edge graphs and achieves better performance in terms of running time.…”
Graph partitioning, a preliminary step of distributed graph processing, has been attracting increasing attention in the last decade. A high quality graph partitioning algorithm should facilitate graph processing by minimizing the communication overhead and maintaining the load balancing among distributed computing units. Offline partitioning algorithms usually require the knowledge of a complete graph,and therefore, are not adaptive to handle massive graph-structured data. On the contrary, streaming partitioning algorithms take edges or vertices as a stream and make partitioning decisions on the fly. However, the streaming manner faces dilemmas from time to time because of a lack of knowledge. Furthermore, an unmindful partitioning decision in such a dilemma could significantly decrease the partition quality. In this paper, we propose a novel window-based streaming graph partitioning algorithm (WSGP). WSGP leverages a greedy-based heuristic to perform edge partitioning. When facing a decision dilemma, WSGP utilizes a size-bounded window to buffer the edges. When the window is fully filled, an edge is poped and assigned to a partition. The assignment is decided by knowledge obtained from both the edges already settled and the ones still cached in the buffer window. Our experiments take into account various real-world benchmark graphs. The experimental results demonstrate that WSGP consistently has a smaller replication factor than the state-of-the-art algorithms by up to 23%, at a limited cost in terms of memory and comprehensive running time.
“…NE selects edge gradually to fully fill each partition and this approach performs quite well. M. Hanai et al [14] proposed a follow-up study. They reformed NE to distribute its approach, this proposition can process trillion-edge graphs and achieves better performance in terms of running time.…”
Graph partitioning, a preliminary step of distributed graph processing, has been attracting increasing attention in the last decade. A high quality graph partitioning algorithm should facilitate graph processing by minimizing the communication overhead and maintaining the load balancing among distributed computing units. Offline partitioning algorithms usually require the knowledge of a complete graph,and therefore, are not adaptive to handle massive graph-structured data. On the contrary, streaming partitioning algorithms take edges or vertices as a stream and make partitioning decisions on the fly. However, the streaming manner faces dilemmas from time to time because of a lack of knowledge. Furthermore, an unmindful partitioning decision in such a dilemma could significantly decrease the partition quality. In this paper, we propose a novel window-based streaming graph partitioning algorithm (WSGP). WSGP leverages a greedy-based heuristic to perform edge partitioning. When facing a decision dilemma, WSGP utilizes a size-bounded window to buffer the edges. When the window is fully filled, an edge is poped and assigned to a partition. The assignment is decided by knowledge obtained from both the edges already settled and the ones still cached in the buffer window. Our experiments take into account various real-world benchmark graphs. The experimental results demonstrate that WSGP consistently has a smaller replication factor than the state-of-the-art algorithms by up to 23%, at a limited cost in terms of memory and comprehensive running time.
“…The communication cost stems from the vertices/edges spanning computing nodes to ensure the synchronization among all computing nodes. Unfortunately, the graph partition problem with these two constraints is proved to be an NP-hard problem [10], so it is often solved by heuristic methods.…”
Section: Related Workmentioning
confidence: 99%
“…SWR [23] resorts the edges in the sliding window to move the edges with low-degree vertices upfront so these edges are less likely to span different computing nodes. Distributed NE [10] selects initial multiple random vertices and then greedily expands each edge set in parallel such that the increase of the vertex cuts becomes minimal, which can allocate most edges in a locally optimal way and seldom uses the random allocation. The locality of real-world graphs also implies many adjacent lists share a lot of common out-neighbors, which is named by target vertices in TSH [24].…”
Graph is an important model to describe various networks, and its scale becomes larger and larger with the development of communication and information technology. The analysis of large-scale graphs requires distributed graph processing systems, and graph partition is the basis of these systems. The existing graph partitioning algorithms are almost proposed for homogeneous clusters, which don't consider the differences among computing nodes in heterogeneous clusters. This paper proposes GAP, a Genetic Algorithm based graph Partitioning algorithm to solve this problem. GAP aims to reduce the total processing time on a heterogeneous cluster by partitioning graphs according to the computing powers of computing nodes. GAP balanced partition the graph initially, and then utilizes genetic algorithm to transfer vertices to reduce cut edges. GAP can balance the processing time of computing nodes, and reduce the communication time among computing nodes. The experiments performed on a heterogeneous cluster demonstrate the outperformance of GAP than Hash.
“…Existing partitioning algorithms can be divided into two categories: In-memory algorithms [30,44,55,66] and streaming algorithms [28,32,47,51,64]. In-memory algorithms load the complete graph into memory, and, hence, have full flexibility to assign any edge to any partition at any time.…”
Section: Introductionmentioning
confidence: 99%
“…Streaming algorithms consume little memory, but even though they have been improved by sophisticated techniques such as window-based streaming [47] and multi-pass streaming [48], they do not yield the same partitioning quality on all graphs as the best in-memory algorithms. In current graph partitioning systems, the user has to decide for one of the two options, and then either provide a very large machine (or a cluster of machines) and get good partitioning quality [30,44,55,66] or a small machine and get worse partitioning quality [28,32,47,51,64].…”
Distributed systems that manage and process graph-structured data internally solve a graph partitioning problem to minimize their communication overhead and query run-time. Besides computational complexity-optimal graph partitioning is NP-hard-another important consideration is the memory overhead. Real-world graphs often have an immense size, such that loading the complete graph into memory for partitioning is not economical or feasible. Currently, the common approach to reduce memory overhead is to rely on streaming partitioning algorithms. While the latest streaming algorithms lead to reasonable partitioning quality on some graphs, they are still not completely competitive to in-memory partitioners. In this paper, we propose a new system, Hybrid Edge Partitioner (HEP), that can partition graphs that fit partly into memory while yielding a high partitioning quality. HEP can flexibly adapt its memory overhead by separating the edge set of the graph into two sub-sets. One sub-set is partitioned by NE++, a novel, efficient in-memory algorithm, while the other sub-set is partitioned by a streaming approach. Our evaluations on large real-world graphs show that in many cases, HEP outperforms both in-memory partitioning and streaming partitioning at the same time. Hence, HEP is an attractive alternative to existing solutions that cannot finetune their memory overheads. Finally, we show that using HEP, we achieve a significant speedup of distributed graph processing jobs on Spark/GraphX compared to state-of-the-art partitioning algorithms.
CCS CONCEPTS• Information systems → Graph-based database models; • Theory of computation → Graph algorithms analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.