Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. As such, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU-and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU-and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.

show abstract

“…Stencil-specific code generators have been used to generate and autotune stencil code on GPUs [10,37]. These techniques target shared memory.…”

Section: Compiler Optimizations Dsls and Programming Models For Stenmentioning

confidence: 99%

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

et al. 2017

View full text Add to dashboard Cite

show abstract

“…A description of the stencil is then stored as a Stencil object, which can be used by other modules for transformation purposes. Stencil description is typically adopted by domain-specific languages (DSLs) [23,47] to deal with this problem. However, while DSLs typically require the user to explicitly define the stencil, the Panda compiler is capable of detecting it automatically, like several existing tools [3,12,42,8].…”

Section: Overviewmentioning

confidence: 99%

Panda: A Compiler Framework for Concurrent CPU $$+$$ + GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Sourouri

Baden

Cai

2016

Int J Parallel Prog

View full text Add to dashboard Cite

This paper describes a new compiler framework for heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil codes originally written in C can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate state-of-the-art hybrid MPI+CUDA+OpenMP code that uses concurrent CPU+GPU computing to unleash the full potential of powerful GPU clusters. At the same time, the auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes from our compiler can achieve about 90% of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. We thus believe that the user-friendliness and performance delivered by our domain-specific compiler framework allow com-

show abstract

“…For small search spaces, an exhaustive search was used to determine the best run-time parameters [23], whereas for a larger search space, methods like dynamic programming or stochastic search can be used [17].…”

Section: Related Workmentioning

confidence: 99%

Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method

Tang

Tan

Krishnamoorthy

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-Stencils represent an important class of computations that are used in many scientific disciplines. Increasingly, many of the stencil computations in scientific applications are being offloaded to GPUs to improve running times. Since a large part of the simulation time is spent inside the stencil kernels, optimizing the kernel is therefore important in the context of achieving greater computation efficiencies and reducing simulation time. In this work, we proposed a novel in-plane method for stencil computations on GPUs and compared its performance with the conventional method implemented in the Nvidia SDK. We also implemented an auto-tuning framework for our method to select the optimal parameters for different GPU architectures. A performance model was developed for our proposed method, and is used to speed up the auto-tuning process. Our results show that a speedup of nearly 2× can be achieved compared to Nvidia's implementation.

show abstract

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Cited by 106 publications

References 20 publications

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Panda: A Compiler Framework for Concurrent CPU $$+$$ + GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method

Contact Info

Product

Resources

About