Solving path problems on the GPU

Buluç, Aydın; Gilbert, John R.; Budak, Ceren

doi:10.1016/j.parco.2009.12.002

Cited by 88 publications

(61 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…[9]) has been the choice of several parallel implementation as the algorithm allows one to study cache blocking techniques. Examples of this approach can be seen in Buluc et al [5], Matsumoto et al [28] and Katz et al [23]. The above works report results on a variety of CPU and GPU architectures.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems

Dutta

Chaitanya

Kothapalli

et al. 2018

IJNC

View full text Add to dashboard Cite

Graph algorithms play an important role in several fields of sciences and engineering. Prominent among them are the All-Pairs-Shortest-Paths (APSP) and related problems. Indeed there are several efficient implementations for such problems on a variety of modern multi-and manycore architectures.It can be noticed that for several graph problems, parallelism offers only a limited success as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, some of these graphs exhibit clear structural properties due to their sparsity. This calls for particular solution strategies aimed at scalable processing of large, sparse graphs on modern parallel architectures.In this paper, we study the applicability of an ear decomposition of graphs to problems such as all-pairs-shortest-paths and minimum cost cycle basis. Through experimentation, we show that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations. We believe that our techniques have the potential to be relevant for designing scalable solutions for other computations on large sparse graphs.

show abstract

Section: Related Workmentioning

confidence: 99%

“…As graphs corresponding to real-world and practical applications have a massive size, parallel processing is often necessary. It is therefore natural that a lot of current research is directed towards efficient algorithmics on a variety of modern and emerging multi-and many-core architectures [4,28,5,34].…”

Section: Introductionmentioning

confidence: 99%

Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems

Dutta

Chaitanya

Kothapalli

et al. 2018

IJNC

View full text Add to dashboard Cite

show abstract

“…Actual algorithms based on this proof are given by various researchers, with minor differences. Our decision to use the DC algorithm as our starting point is inspired by its demonstrated better cache reuse on CPUs [33], and its impressive performance attained on the many-core graphical processor units [11].…”

Section: Previous Workmentioning

confidence: 99%

“…SSSP algorithms based on ∆-stepping [32] scale better in practice but their performance is input dependent and scales with O(m+d·L·log n), where d is the maximum vertex degree and L is the maximum shortest path weight from the source. Consequently, it is likely that a Floyd-Warshall based approach would be competitive even for sparse graphs, as realized on graphical processing units [11].…”

Section: Introductionmentioning

confidence: 99%

Minimizing Communication in All-Pairs Shortest Paths

Solomonik¹,

Buluç²,

Demmel³

2013

View full text Add to dashboard Cite

Abstract-We consider distributed memory algorithms for the all-pairs shortest paths (APSP) problem. Scaling the APSP problem to high concurrencies requires both minimizing inter-processor communication as well as maximizing temporal data locality. The 2.5D APSP algorithm, which is based on the divide-andconquer paradigm, satisfies both of these requirements: it can utilize any extra available memory to perform asymptotically less communication, and it is rich in semiring matrix multiplications, which have high temporal locality. We start by introducing a block-cyclic 2D (minimal memory) APSP algorithm. With a careful choice of block-size, this algorithm achieves known communication lower-bounds for latency and bandwidth. We extend this 2D block-cyclic algorithm to a 2.5D algorithm, which can use c extra copies of data to reduce the bandwidth cost by a factor of c 1/2 , compared to its 2D counterpart. However, the 2.5D algorithm increases the latency cost by c 1/2 . We provide a tighter lower bound on latency, which dictates that the latency overhead is necessary to reduce bandwidth along the critical path of execution. Our implementation achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.

show abstract

“…In particular, recent GPU cards produced by NVIDIA Corporation provide substantial benefits for parallel computation, and the company itself supplies an easyto-implement environment for developers and researchers. Recently, the effectiveness and advantages of using GPUs for technical computations have been widely reported [6]- [8].…”

Section: Introductionmentioning

confidence: 99%