Parallel breadth-first search on distributed memory systems

Buluç, Aydın; Madduri, Kamesh

doi:10.1145/2063384.2063471

Cited by 149 publications

(136 citation statements)

References 35 publications

Supporting

Mentioning

136

Contrasting

Order By: Relevance

“…CombBLAS curve is mostly flat (only 9% deviation) due to its in-core computational bottlenecks, while SEJITS+KDT and CombBLAS shows higher deviations (54% and 62%, respectively) from a perfect flat line. However, these deviations are expected on a large scale BFS run and are experienced on similar architectures [14].…”

Section: Parallel Scalingmentioning

confidence: 81%

Parallel processing of filtered queries in attributed semantic graphs

Lugowski

Kamil

Buluç

et al. 2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Execution of complex analytic queries on massive semantic graphs is a challenging problem in big-data analytics that requires high-performance parallel computing. In a semantic graph, vertices and edges carry attributes of various types and the analytic queries typically depend on the values of these attributes. Thus, the computation must view the graph through a filter that passes only those individual vertices and edges of interest. Previous investigations have developed Knowledge Discovery Toolbox (KDT), a sophisticated a Python library for parallel graph computations. In KDT, the user can write custom graph algorithms by specifying operations between edges and vertices (semiring operations). The user can also customize existing graph algorithms by writing filters. Although the high-level language for this customization enables domain scientists to productively express their graph analytics requirements, the customized queries perform poorly due to the overhead of having to call into the Python virtual machine for each vertex and edge.In this work, we use the Selective Embedded Just-In-Time Specialization (SEJITS) approach to automatically translate semiring operations and filters defined by programmers into a lower-level efficiency language, bypassing the upcall into Python. We evaluate our approach by comparing it with the high-performance Combinatorial BLAS engine and show that our approach combines the benefits of programming in a highlevel language with executing in a low-level parallel environment. We increase the system's flexibility by developing techniques that provide users with the ability to define new vertex and edge types from Python.We also present a new Roofline model for graph traversals and show that we achieve performance that is significantly closer to the bounds suggested by the Roofline. Finally, to further understand the complex interaction with the underlying architecture, we present an analysis using performance counters that quantifies the improvement in hardware behavior in the context our SEJITS methodology. Overall, we demonstrate the first known solution to the problem of obtaining high performance from a productivity language when applying graph algorithms selectively on semantic graphs with hundreds of millions of edges and scaling to thousands of processors for graphs.

show abstract

Section: Parallel Scalingmentioning

confidence: 81%

Parallel processing of filtered queries in attributed semantic graphs

Lugowski

Kamil

Buluç

et al. 2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…Both 1D and 2D algorithms can be enhanced by in-node multithreading, resulting in one MPI process per chip instead of one MPI process per core, which will reduce the number of communicating parties. Large scale experiments of 1D versus 2D show that the 2D approach's communication costs are lower than the respective 1D approach's, with or without in-node multithreading [6]. The study also shows that in-node multithreading gives a further performance boost by decreasing network contention.…”

Section: Parallel Top-down Bfsmentioning

confidence: 86%

“…To yield a fast direction-optimizing BFS implementation, our bottom-up implementation is combined with an existing performant top-down implementation [6]. We provide a parallel complexity analysis of the new algorithm in terms of the bandwidth and synchronization (latency) costs in Section V. Section VI gives details about our directionoptimizing approach that combines top-down and bottom-up steps.…”

Section: Introductionmentioning

confidence: 99%

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Beamer¹,

Buluç²,

Asanovi³

et al. 2013

View full text Add to dashboard Cite

Abstract-Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the worst case. A recently discovered bottom-up approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottom-up approach is not always advantageous, so it is combined with the top-down approach to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. We present a scalable distributed-memory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely top-down code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7× faster than a conventional top-down algorithm using the same set of optimizations and data distribution.

show abstract

“…Buluc et al [5] conducted extensive performance studies of partitioning schemes for BFS on large-scale machines at LNBL, Hopper (6,392 nodes) and Franklin (9,660 nodes), comparing 1-D and 2-D partitioning strategies. Satish et al [10] proposed an efficient BFS algorithm on commodity supercomputing clusters consisting of Intel CPU and the Infiniband Network.…”

Section: Related Workmentioning

confidence: 99%