Graph processing on GPUs: Where are the bottlenecks?

Xu, Qing; Jeon, Hyeran; Annavaram, Murali

doi:10.1109/iiswc.2014.6983053

Cited by 72 publications

(41 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Xu et al [16] studied 12 graph applications in order to identify bottlenecks that limit GPU performance. They show that graph applications tend to need frequent kernel invocations and make ineffective use of caches compared to non-graph applications.…”

Section: Related Workmentioning

confidence: 99%

“…Workload distribution and load balancing are crucial issues for performance; previous work has observed that these operations are dependent on graph structure [17,16]. Hardwired graph primitive implementations have prioritized efficient (and primitive-customized) implementations of these operations, thus to be competitive, high-level programmable frameworks must offer high-performance but high-level strategies to address them.…”

Section: Critical Aspects For Efficiencymentioning

confidence: 99%

See 1 more Smart Citation

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Wang

Pan

et al. 2015

2015 IEEE International Symposium on Workload Characterization

View full text Add to dashboard Cite

Abstract-We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gunrock, MapGraph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient largescale graph analytics on the GPU.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Critical Aspects For Efficiencymentioning

confidence: 99%

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Wang

Pan

et al. 2015

2015 IEEE International Symposium on Workload Characterization

View full text Add to dashboard Cite

show abstract

“…We also note that the algorithm of Auer and Bisseling repeatedly considers all the vertices and is therefore not (work) efficient, but scales better. Xu et al use the algorithm of Auer and Bisseling, along with several other graph algorithms [21]. We address some of the performance issues raised by them in our work.…”

Section: Related Workmentioning

confidence: 99%

“…The Nvidia Kepler K40 presented in Section 3 is currently one of the best manycore platforms for scientific computing. While many significant performance gains for compute intensive applications with regular and predictable memory access patterns have been demonstrated using GPUs, the efficient implementation of irregular applications such as graph algorithms remains a challenge [21]. Highly irregular degree distributions, poor locality in memory accesses, and minimal computation on accessed data make efficient utilization of compute resources challenging.…”

Section: Gpu-suitor-hybridmentioning

confidence: 99%

Optimizing Approximate Weighted Matching on Nvidia Kepler K40

Naim

Manne

Halappanavar

et al. 2015

2015 IEEE 22nd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set of 269 inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by 10 to 100× for over 100 instances, and from 100 to 1000× for 15 instances. We also demonstrate up to 20× speedup relative to 2 threads, and up to 5× relative to 16 threads on Intel Xeon platform with 16 cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.

show abstract

“…A continuation of that research uses a software simulator to change GPU architectural parameters and observes performance is more sensitive to L2 cache parameters than to DRAM parameters, which suggests there is exploitable locality [37]. Xu et al also use a simulator and identify synchronization with the CPU (kernel invocations and data transfers) as well as GPU memory latency to be the biggest performance bottlenecks [44]. Che et al profile the Pannotia suite of graph algorithms and observe substantial diversity across algorithms and inputs [10].…”

Section: Related Workmentioning

confidence: 99%

Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server

Beamer

Asanović

Patterson

2015

2015 IEEE International Symposium on Workload Characterization

144

View full text Add to dashboard Cite

Abstract-Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three highperformance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform's memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.

show abstract

Graph processing on GPUs: Where are the bottlenecks?

Cited by 72 publications

References 26 publications

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Optimizing Approximate Weighted Matching on Nvidia Kepler K40

Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server

Contact Info

Product

Resources

About