Abstract-Large scale graph processing represents an interesting systems challenge due to the lack of locality. This paper presents PathGraph, a system for improving iterative graph computation on graphs with billions of edges. Our system design has three unique features: First, we model a large graph using a collection of tree-based partitions and use pathcentric computation rather than vertex-centric or edge-centric computation. Our path-centric graph parallel computation model significantly improves the memory and disk locality for iterative computation algorithms on large graphs. Second, we design a compact storage that is optimized for iterative graph parallel computation. Concretely, we use delta-compression, partition a large graph into tree-based partitions and store trees in a DFS order. By clustering highly correlated paths together, we further maximize sequential access and minimize random access on storage media. Third but not the least, we implement the path-centric computation model by using a scatter/gather programming model, which parallels the iterative computation at partition tree level and performs sequential local updates for vertices in each tree partition to improve the convergence speed. We compare PathGraph to most recent alternative graph processing systems such as GraphChi and X-Stream, and show that the path-centric approach outperforms vertex-centric and edge-centric systems on a number of graph algorithms for both in-memory and out-of-core graphs.
As RDF data continues to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One of the grant challenges for managing this huge RDF data is how to execute RDF queries e ciently. In this paper, we address the query processing problems against the billion triple challenges. We first identify some causes for the problems of existing query optimization schemes, such as large intermediate results, initial query cost estimation errors. Then we present our block oriented dynamic query plan generation approach powered with pipelining execution. Our approach consists of two phases. In the first phase, a near optimal execution plan for queries is chosen by identifying the processing blocks of queries. We group the join patterns sharing a join variable into building blocks of the query plan since executing them first provides opportunities to reduce the size of intermediate results generated. In the second phase, we further optimize the initial pipelining for a given query plan. We employ optimization techniques, such as sideways information passing and semijoin, to further reduce the size of intermediate results, improve the query processing cost estimation and speedup the performance of query execution. Experimental results on several RDF datasets of over a billion triples demonstrate that our approach outperforms existing RDF query engines that rely on dynamic programming based static query processing strategies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.