2016
DOI: 10.1109/tpds.2015.2475270
|View full text |Cite
|
Sign up to set email alerts
|

Parallel Distributed Breadth First Search on the Kepler Architecture

Abstract: We present the results obtained by using an evolution of our CUDAbased solution for the exploration, via a Breadth First Search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can visit more than 800 billion edges in a second by using a cluster equipped with 4096 Tesla K20X GPUs.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 25 publications
(18 citation statements)
references
References 25 publications
0
18
0
Order By: Relevance
“…This is partitioned using METIS and a BFS is performed from s. The aim here is to give an estimate of performance without the approximation error inherent in Equation 10, and 4. Using the actual degree distribution, p k , the joint degree distribution p k,k ′ (see note below) and the number of vertices in the peak iteration together with equations (10,11,12) we form a single weighted graph called, W avg . We note that these quantities are computationally inexpensive to calculate and a reasonable estimate may be formed from a small number of BFS runs (here we use 10 runs).…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…This is partitioned using METIS and a BFS is performed from s. The aim here is to give an estimate of performance without the approximation error inherent in Equation 10, and 4. Using the actual degree distribution, p k , the joint degree distribution p k,k ′ (see note below) and the number of vertices in the peak iteration together with equations (10,11,12) we form a single weighted graph called, W avg . We note that these quantities are computationally inexpensive to calculate and a reasonable estimate may be formed from a small number of BFS runs (here we use 10 runs).…”
Section: Resultsmentioning
confidence: 99%
“…[17] note that for low degree vertices partitioning should be based on vertex but for large degree vertices the partitioning should be based on edge. In contrast a 2-D partition [2,8,10,11] distributes the edges of a vertex across several processors. The 2-D approach is based on the observation that an exploration from a set of vertices is equivalent to the product of the adjacency matrix and a vector of the vertices touched.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In generalpurpose CPU and multicore/supercomputing approaches [16,17], Agarwal et al performed locality optimizations on a quad-socket system to reduce memory traffic [18]. A considerable amount of research on parallel BFS implementations on GPUs focuses on level-synchronous or fixed-point methods [19,20]. The reconfigurable hardware approach in solving graph traversal problems on clusters of FPGAs is limited by graph size and synthesis times [4,8].…”
Section: Related Workmentioning
confidence: 99%
“…The final result is a huge improvement of the performance as shown in Figure 8: now, by using 4096 K20x GPUs, we achieve more than 800 GTEPS. Further details can be found in [7].…”
Section: D Graph Decompositionmentioning
confidence: 99%