Compiler-Directed Transformation for Higher-Order Stencils

Basu, Protonu; Hall, Mary; Williams, Samuel; Straalen, Brian Van; Oliker, Leonid; Colella, Phillip

doi:10.1109/ipdps.2015.103

Cited by 34 publications

(20 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To ensure a tight coupling, several prior efforts on guiding register allocation or instruction scheduling were implemented as a compiler pass in research/prototype compilers [7,16,20,41,45], or open-source production compilers [29,46]. However, like some other recent efforts [6,28,50], we implement our reordering optimization at source level for the following reasons: (1) it allows external optimizations for closed-source compilers like NVCC; (2) it allows us to perform transformations like exposing FMAs using operator distributivity, and performing kernel fusion/fission, which can be performed more effectively and efficiently at source level; and (3) it is input-dependent, not machine-or compilerdependent -with an implementation coupled to compiler passes, it would have to be re-implemented across compilers with different intermediate representation. Our framework massages the input to a form that is more amenable to further optimizations by any GPU compiler, and we use appropriate compilation flags whenever possible to ensure that our reordering optimization is not undone by the compiler passes.…”

Section: Experimental Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Register optimizations for stencils on GPUs

et al. 2018

View full text Add to dashboard Cite

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes. for (i=2; i=-2; ii-) for (jj=-2; jj<=2; jj++) out[i][j] += in[i+ii][j+jj] * w[ii+2][jj+2]; }

show abstract

Section: Experimental Evaluationmentioning

confidence: 99%

“…Basu et al [6] propose a partial sum optimization implemented within the CHiLL compiler [23]. The partial sums are computed over planes for 3D stencils, and redundant computation is eliminated by performing array common subexpression elimination (CSE) [15].…”

Section: Related Workmentioning

confidence: 99%

Register optimizations for stencils on GPUs

et al. 2018

View full text Add to dashboard Cite

show abstract

“…In fact, many scientific computations are compiled using -ffast-math flag [4], which allows a compiler to exploit associativity of floating-point operations to improve performance at the expense of IEEE compliance. Many recent efforts have leveraged operator associativity to drive code optimization strategies [1], [5], [6].…”

Section: Background and Motivationmentioning

confidence: 99%

Associative Instruction Reordering to Alleviate Register Pressure

Rawat

Sukumaran-Rajam²,

Rountev³

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Register allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, existing register allocation strategies are not effective and result in excessive register spilling for computation patterns with a high degree of many-to-many data reuse, e.g., high-order stencils and tensor contractions. We develop a source-to-source instruction reordering strategy that exploits the flexibility of reordering associative operations to alleviate register pressure. The developed transformation module implements an adaptable strategy that can appropriately control the degree of instructionlevel parallelism, while relieving register pressure. The effectiveness of the approach is demonstrated through experimental results using multiple production compilers (GCC, Clang/LLVM) and target platforms (Intel Xeon Phi, and Intel x86 multi-core).

show abstract

“…After applying optimizations specified in the script, CHiLL generates optimized C (or Fortran) code. Recently CHiLL has been extended to generate OpenMP code [2].…”

Section: Chill Backgroundmentioning

confidence: 99%

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

et al. 2017

Self Cite

View full text Add to dashboard Cite

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. As such, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU-and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU-and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.

show abstract

Compiler-Directed Transformation for Higher-Order Stencils

Cited by 34 publications

References 37 publications

Register optimizations for stencils on GPUs

Register optimizations for stencils on GPUs

Associative Instruction Reordering to Alleviate Register Pressure

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Contact Info

Product

Resources

About