Parallel Distributed Breadth First Search on the Kepler Architecture

Bisson, Mauro; Bernaschi, Massimo; Mastrostefano, Enrico

doi:10.1109/tpds.2015.2475270

Cited by 25 publications

(18 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is partitioned using METIS and a BFS is performed from s. The aim here is to give an estimate of performance without the approximation error inherent in Equation 10, and 4. Using the actual degree distribution, p k , the joint degree distribution p k,k ′ (see note below) and the number of vertices in the peak iteration together with equations (10,11,12) we form a single weighted graph called, W avg . We note that these quantities are computationally inexpensive to calculate and a reasonable estimate may be formed from a small number of BFS runs (here we use 10 runs).…”

Section: Resultsmentioning

confidence: 99%

“…[17] note that for low degree vertices partitioning should be based on vertex but for large degree vertices the partitioning should be based on edge. In contrast a 2-D partition [2,8,10,11] distributes the edges of a vertex across several processors. The 2-D approach is based on the observation that an exploration from a set of vertices is equivalent to the product of the adjacency matrix and a vector of the vertices touched.…”

Section: Related Workmentioning

confidence: 99%

“…BFS is central to several more complicated graph algorithms such as identifying connected components, testing for bipartiteness, belief propagation, finding community structures in social networks and computing the max flow-min cut for a graph [1]. As such it has drawn much attention from from the parallel processing community as a benchmark algorithm with several competing variants focused on efficient implementation [1][2][3][4][5][6][7][8][9][10][11][12][13]. However, despite its importance known structural properties of social networks have not been leveraged to improve the algorithms efficiency.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Predictive Partitioning for Efficient BFS Traversal in Social Networks

Fay

2016

Studies in Computational Intelligence

View full text Add to dashboard Cite

Abstract. In this paper we show how graph structure can be used to drastically reduce the computational bottleneck of the Breadth First Search algorithm (the foundation of many graph traversal techniques). In particular, we address parallel implementations where the bottleneck is the number of messages between processors emitted at the peak iteration. First, we derive an expression for the expected degree distribution of vertices in the frontier of the algorithm which is shown to be highly skewed. Subsequently, we derive an expression for the expected message along an edge in a particular iteration. This skew suggests a weighted, iteration based, partition would be advantageous. Employing the METIS algorithm we then show empirically that such partitions can reduce the message overhead by up to 50% in some particular instances and in the order of 20% on average. These results have implications for graph processing in multiprocessor and distributed computing environments.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Predictive Partitioning for Efficient BFS Traversal in Social Networks

Fay

2016

Studies in Computational Intelligence

View full text Add to dashboard Cite

show abstract

“…In generalpurpose CPU and multicore/supercomputing approaches [16,17], Agarwal et al performed locality optimizations on a quad-socket system to reduce memory traffic [18]. A considerable amount of research on parallel BFS implementations on GPUs focuses on level-synchronous or fixed-point methods [19,20]. The reconfigurable hardware approach in solving graph traversal problems on clusters of FPGAs is limited by graph size and synthesis times [4,8].…”

Section: Related Workmentioning

confidence: 99%

Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data

Huang

Sun

et al. 2017

Scientific Programming

View full text Add to dashboard Cite

The Internet applications, such as network searching, electronic commerce, and modern medical applications, produce and process massive data. Considerable data parallelism exists in computation processes of data-intensive applications. A traversal algorithm, breadth-first search (BFS), is fundamental in many graph processing applications and metrics when a graph grows in scale. A variety of scientific programming methods have been proposed for accelerating and parallelizing BFS because of the poor temporal and spatial locality caused by inherent irregular memory access patterns. However, new parallel hardware could provide better improvement for scientific methods. To address small-world graph problems, we propose a scalable and novel field-programmable gate array-based heterogeneous multicore system for scientific programming. The core is multithread for streaming processing. And the communication network InfiniBand is adopted for scalability. We design a binary search algorithm to address mapping to unify all processor addresses. Within the limits permitted by the Graph500 test bench after 1D parallel hybrid BFS algorithm testing, our 8-core and 8-thread-per-core system achieved superior performance and efficiency compared with the prior work under the same degree of parallelism. Our system is efficient not as a special acceleration unit but as a processor platform that deals with graph searching applications.

show abstract

“…The final result is a huge improvement of the performance as shown in Figure 8: now, by using 4096 K20x GPUs, we achieve more than 800 GTEPS. Further details can be found in [7].…”

Section: D Graph Decompositionmentioning

confidence: 99%

Enhanced GPU-based distributed breadth first search

Bernaschi

Carbone

Mastrostefano

et al. 2015

Proceedings of the 12th ACM International Conference on Computing Frontiers

Self Cite

View full text Add to dashboard Cite

There is growing interest in studying large scale graphs having millions of vertices and billions of edges, up to the point that a specific benchmark, called Graph500, has been defined to measure the performance of graph algorithms on modern computing architectures. At first glance, Graphics Processing Units (GPUs) are not an ideal platform for the execution of graph algorithms that are characterized by low arithmetic intensity and irregular memory access patterns. For studying really large graphs, multiple GPUs are required to overcome the memory size limitations of a single GPU. In the present paper, we propose several techniques to minimize the communication among GPUs

show abstract

Parallel Distributed Breadth First Search on the Kepler Architecture

Cited by 25 publications

References 25 publications

Predictive Partitioning for Efficient BFS Traversal in Social Networks

Predictive Partitioning for Efficient BFS Traversal in Social Networks

Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data

Enhanced GPU-based distributed breadth first search

Contact Info

Product

Resources

About