Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

Bader, David A.; Madduri, Kamesh

doi:10.1109/icpp.2006.34

Cited by 149 publications

(115 citation statements)

References 35 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…The graphs also shows an important property of the new Nehalem processors: we can hide the memory latency by keeping a number of read requests in flight, as traditionally done by multi-threaded architectures [16], [15]. Surprisingly, with a simple software pipelining strategy we can increase by a factor of eight the number of transactions per second: for example, with a working set of 8MB, the memory subsystem can satisfy up to 160 millions reads per second, and with 2 GB we can achieve 40 millions of random reads per second.…”

Section: System Architecture and Experimental Platformsmentioning

confidence: 99%

See 1 more Smart Citation

Scalable Graph Exploration on Multicore Processors

Agarwal

Petrini

Pasetto³

et al. 2010

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

194

151

View full text Add to dashboard Cite

Abstract-Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-first search (BFS) algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems. Our new methodology for large-scale graph analytics combines a highlevel algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4-socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an R-MAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50.

show abstract

Section: System Architecture and Experimental Platformsmentioning

confidence: 99%

“…A good amount of literature deals with the design of BFS solutions, either based on commodity processors [11], [12] or special purpose hardware [13], [14], [15], [16]. Some recent publications describe successful parallelization strategies of list ranking [17] and phylogenetic trees on the Cell BE [18].…”

Section: Introductionmentioning

confidence: 99%

Scalable Graph Exploration on Multicore Processors

Agarwal

Petrini

Pasetto³

et al. 2010

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

194

151

View full text Add to dashboard Cite

show abstract

“…On massively multithreaded systems, Bader and Madduri [23] introduce a fine-grained implementation on the Cray MTA-2 system using the level synchronous approach, achieving good scaling on the 40 processor MTA-2. Mizell and Maschhoff [24] improve and port this algorithm to the Cray XMT, the successor to the MTA-2.…”

Section: Related Workmentioning

confidence: 99%

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Beamer

Buluç

Asanović

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

Abstract-Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the worst case. A recently discovered bottom-up approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottom-up approach is not always advantageous, so it is combined with the top-down approach to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. We present a scalable distributed-memory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely top-down code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7× faster than a conventional top-down algorithm using the same set of optimizations and data distribution.

show abstract

“…GPU implementation of FW for smaller graphs is given in [8] and for larger graphs shared memory and cache efficient GPU implementations for APSP using FW are given in [16] [9].To further enhance the performance some optimization techniques like tiling, loop unrolling and SIMD vectorization can be used.…”

Section: Problem Time Complexitymentioning

confidence: 99%

OpenCL Parallel Blocked Approach for Solving All Pairs Shortest Path Problem on GPU

Pandey¹,

Sharma²

2015

IJCA

View full text Add to dashboard Cite

All-Pairs Shortest Path Problem (APSP) finds a large number of practical applications in real world. This paper presents a blocked parallel approach for APSP using an open standard framework OpenCL, which provides development environment for utilizing heterogeneous computing elements of computer system and to take advantage of massive parallel capabilities of multi-core processors such as graphics processing unit (GPU) and CPU. This blocked parallel approach exploits the local shared memory of GPU, thereby enhancing the overall performance. The proposed solution is for directed and dense graphs with no negative cycles and is based on blocked Floyd Warshall (FW) and Kleene"s algorithm. Like Floyd Warshall this approach is also in-place and therefore requires no extra memory.

show abstract

Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

Abstract: Abstract

Cited by 149 publications

References 35 publications

Scalable Graph Exploration on Multicore Processors

Scalable Graph Exploration on Multicore Processors

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

OpenCL Parallel Blocked Approach for Solving All Pairs Shortest Path Problem on GPU

Contact Info

Product

Resources

About