A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

Yoo, Andy; Chow, Edmond; Henderson, Keith; McLendon, William Clarence; Hendrickson, Bruce; Çatalyürek, Ümit V.

doi:10.1109/sc.2005.4

Cited by 188 publications

(159 citation statements)

References 19 publications

Supporting

Mentioning

158

Contrasting

Unclassified

Order By: Relevance

“…Local discovery (per substep) (lines [11][12][13][14][15][16][17][18][19][20] Search for parents with the information available locally.…”

Section: Parallel and Distributed Bfs Algorithmmentioning

confidence: 99%

“…Yoo [16] improves on this by employing block-cyclic distribution, eliminating the need for transpose vector at the cost of added code complexity. We adapt Yoo's method so that it becomes applicable to hybrid BFS ( Fig.…”

Section: Reducing Communication With Better Partitioningmentioning

confidence: 99%

“…As we mentioned, Yoo [16] proposed an effective method for 2-D graph partitioning for BFS in a large-scale distributed-memory computing environment; the base algorithm itself was a simple top-down BFS and was evaluated on a large-scale environment 32,768 node BlueGene/L.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

et al. 2017

View full text Add to dashboard Cite

“…Local discovery (per substep) (lines [11][12][13][14][15][16][17][18][19][20] Search for parents with the information available locally.…”

Section: Parallel and Distributed Bfs Algorithmmentioning

confidence: 99%

Section: Reducing Communication With Better Partitioningmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

et al. 2017

View full text Add to dashboard Cite

“…There have been numerous implementations of parallel graph algorithms using various computer architectures, including distributed memory supercomputers [36], shared memory supercomputers [4], and multi-core SMP machines [21]. In the context of points-to analyses, the only parallel implementation we know of [25] has been discussed in depth in previous sections.…”

Section: Related Workmentioning

confidence: 99%

A GPU implementation of inclusion-based points-to analysis

Méndez-Lojo

Burtscher

Pingali

2012

Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

show abstract

“…LLNL first demonstrated breadth-first search of a 3 × 10 9 node graph on the IBM BlueGene/L, the world's fastest supercomputer. 7 A random graph of this size is the largest that can fit in the machine's 32,768-node memory. Subsequently, LLNL processed a 10 10 -node scale-free graph using a very different approach and architecture.…”

Section: I/o-intensive Sparse Graph Analysismentioning

confidence: 99%

Hardware Technologies for High-Performance Data-Intensive Computing

et al. 2008

View full text Add to dashboard Cite

A s the amount of scientific and social data continues to grow, researchers in a multitude of domains face challenges associated with storing, indexing, retrieving, assimilating, and synthesizing raw data into actionable information. Combining techniques from computer science, statistics, and applied math, data-intensive computing involves developing and optimizing algorithms and systems that interact closely with large volumes of data.Scientific applications that read and write large data sets often perform poorly and don't scale well on presentday computing systems. Many data-intensive applications are data-path-oriented, making little use of branch prediction and speculation hardware in the CPU. These applications are well suited to streaming data access and can't effectively use the sophisticated on-chip cache hierarchy. Their ability to process large data sets is hampered by orders-of-magnitude mismatches between disk, memory, and CPU bandwidths.Emerging technologies can improve data-intensive algorithms' performance, at reasonable cost in development time, by an order of magnitude over the state of the art. Coprocessors such as graphics processor units (GPUs) and field-programmable gate arrays (FPGAs) can significantly speed up some application classes in which data-path-oriented computing is dominant. Additionally, these coprocessors interact with application-controlled on-chip memory rather than a traditional cache.To alleviate the 10-to-100 factor mismatch in bandwidth between disk and memory, we investigated an I/O system built from a large, parallel array of solid-state storage devices. While containing the same NAND flash chips as USB drives, such I/O arrays achieve significantly higher bandwidth and lower latency than USB drives through parallel access to an array of devices.To quantify these technologies' merits, we've created a small collection of data-intensive benchmarks selected from applications in data analysis and science. These benchmarks draw from three data types: scientific imagery, unstructured text, and semantic graphs representing networks of relationships. Our results demonstrate that augmenting commodity processors to exploit these technologies can improve performance 2 to 17 times. COPROCESSORSCoprocessors designed for data-oriented computing can deliver orders-of-magnitude better performance than general-purpose microprocessors on data-pathcentric compute kernels. We evaluated the benefits of two coprocessor architectures: graphics processors and reconfigurable hardware.Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements. Experiments with three benchmarks suggest that emerging hardware technologies can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.

show abstract

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

Cited by 188 publications

References 19 publications

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

A GPU implementation of inclusion-based points-to analysis

Hardware Technologies for High-Performance Data-Intensive Computing

Contact Info

Product

Resources

About