Multi-GPU Graph Analytics

Pan, Yuechao; Wang, Yangzihao; Wu, Yuduo; Yang, Carl; Owens, John D.

doi:10.1109/ipdps.2017.117

Cited by 47 publications

(39 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another limitation of existing systems is that they are integrated solutions that come with their own programming models, runtime systems, and communication runtimes, which makes it difficult to reuse infrastructure to build new systems. For example, all existing GPU graph analytics systems such as Gunrock [56,69], Groute [8], and IrGL [55] are limited to a single node, and there is no way to reuse infrastructure from existing distributed graph analytics systems to build GPU-based distributed graph analytics systems from these single-node systems.…”

Section: Introductionmentioning

confidence: 99%

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

et al. 2018

View full text Add to dashboard Cite

This paper introduces a new approach to building distributedmemory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon's ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon's ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for * Both authors contributed equally.

show abstract

Section: Introductionmentioning

confidence: 99%

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

et al. 2018

View full text Add to dashboard Cite

show abstract

“…We compare our results with previous efforts in Table II. When compared against single-node multi-GPU Gunrock [5], this work is a little slower when using the same graphs, which may be the effect of more optimizations in Gunrock's traversal kernels. As we add more GPUs in this work, we see the gap in performance is narrowing, which indicates better scalability; and the memory size improvements we made in this paper allows us to process larger graphs on one node, up to scale 28 on 4 GPUs, than any other GPU-based previous work.…”

Section: Overall Results and Comparisonsmentioning

confidence: 99%

“…Using GPUs in the same node for BFS yields impressive per-node performance [5], [9], [11], [12], but because all [5] their communication is within a node and thus faster than within a cluster, their per-node performance is superior to cluster-based solutions. However, their graphs must fit into one node's memory (GPU or CPU), and this inherently limits the maximum size of a processed graph.…”

Section: Bfs Within Single Nodementioning

confidence: 99%

Scalable Breadth-First Search on a GPU Cluster

Pan

Pearce

Owens

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadthfirst search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for highdegree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.

show abstract

“…In this paper, we address this problem by exploiting graphics processing units (GPUs). Because of massive hardware parallelism and high memory bandwidth, GPUs have been widely used in diverse applications including machine learning [5][6][7], graph processing [8][9][10], big data analytics [11,12], image processing [13], and fluid dynamics [14]. In order to reap the power of GPUs, the algorithmic steps need to be mapped delicately onto the architecture of GPUs, especially the thread and memory hierarchy.…”

Section: Introductionmentioning

confidence: 99%

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs

Zhang

2019

Future Internet

View full text Add to dashboard Cite

Radio-frequency (RF) tomographic imaging is a promising technique for inferring multi-dimensional physical space by processing RF signals traversed across a region of interest. Tensor-based approaches for tomographic imaging are superior at detecting the objects within higher dimensional spaces. The recently-proposed tensor sensing approach based on the transform tensor model achieves a lower error rate and faster speed than the previous tensor-based compress sensing approach. However, the running time of the tensor sensing approach increases exponentially with the dimension of tensors, thus not being very practical for big tensors. In this paper, we address this problem by exploiting massively-parallel GPUs. We design, implement, and optimize the tensor sensing approach on an NVIDIA Tesla GPU and evaluate the performance in terms of the running time and recovery error rate. Experimental results show that our GPU tensor sensing is as accurate as the CPU counterpart with an average of 44.79 × and up to 84.70 × speedups for varying-sized synthetic tensor data. For IKEA Model 3D model data of a smaller size, our GPU algorithm achieved 15.374× speedup over the CPU tensor sensing. We further encapsulate the GPU algorithm into an open-source library, called cuTensorSensing (CUDA Tensor Sensing), which can be used for efficient RF tomographic imaging.

show abstract

Multi-GPU Graph Analytics

Cited by 47 publications

References 28 publications

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

Scalable Breadth-First Search on a GPU Cluster

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs

Contact Info

Product

Resources

About