Performance portable GPU code generation for matrix multiplication

ACM Trans. Archit. Code Optim.

et al. 2019

Self Cite

Stencil computations are a widely used type of algorithm, found in applications from physical simulations to machine learning. Stencils are embarrassingly parallel, therefore fit on modern hardware such as Graphic Processing Units perfectly. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain-specific Languages (DSLs) have raised the programming abstraction and offer good performance; however, this method places the burden on DSL implementers to write almost full-fledged parallelizing compilers and optimizers.Lift has recently emerged as a promising approach to achieve performance portability by using a small set of reusable parallel primitives that DSL or library writers utilize. Lift's key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space.This article demonstrates how complex multi-dimensional stencil code and optimizations are expressed using compositions of simple 1D Lift primitives and rewrite rules. We introduce two optimizations that provide high performance for stencils in particular: classical overlapped tiling for multi-dimensional stencils and 2.5D tiling specifically for 3D stencils. We provide an in-depth analysis on how the tiling optimizations affects stencils of different shapes and sizes across different applications. Our experimental results show that our approach outperforms existing compiler approaches and hand-tuned codes.Extension of Conference Paper High performance stencil code generation with Lift published at CGO 2018 [22]. This paper presents an extended in-depth discussion of a real-world stencil application, the representation of a optimization specific for 3-dimensional stencils -2.5D tiling -as a rewrite rule, and additional performance results analyzing the performance characteristics of 2.5D tiling, in particular with respect to different stencil sizes and shapes.

Section: Methodsmentioning

confidence: 99%

“…This design makes it easy to extend and add new optimizations into the compiler, whereas in Delite optimizations are hard-coded for each backend. A more detailed discussion about this process can be found in our previous work [42].…”

Section: The Real Challenge: Universal High Performance Code Generationmentioning

confidence: 99%

Tiling Optimizations for Stencil Computations Using Rewrite Rules in L ift

Stoltzfus

Hagedorn

ACM Trans. Archit. Code Optim.

et al. 2019

Self Cite

“…Introducing Partition via Rewriting Rules One of the core ideas underpinning Lift is the use of an automated exploration system that uses rewriting rules to automatically generate high performance code. A rewrite rule is a semantic preserving transformation of expressions, and is Lift's way to express optimization choices that are automatically explored in the optimization process using stochastic methods, as explained by [15].…”

Section: Partitionmentioning

confidence: 99%

Position-dependent arrays and their application for high performance code generation

Pizzuti

Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing

Dubach

2019

Self Cite

Modern parallel hardware promises unprecedented performance, for the gifted few experts who can program it correctly. Code generators from high-level languages provide an attractive alternative, promising to deliver high performance automatically. Existing projects such as Accelerate, Futhark, Halide, or Lift show that this approach is feasible. Unfortunately, existing efforts focus on computations over tensors: regularly shaped higher dimensional arrays. This limits the expressiveness of these approaches and excludes many interesting data structures that are commonly encoded manually in memory, such as trees or triangular matrices.This paper presents an extended array type that lifts this restriction. For multidimensional arrays, the size of a nested array might depend on its position in the surrounding arrays, which enables the expression of computations over less regularly shaped data structures. However, these positiondependent arrays bring new challenges for high-performance code generation, as determining the position of the elements in memory becomes more challenging.This paper shows how these challenges are addressed by extending the existing Lift type system and compiler. The experimental results show that this approach enables the efficient code generation of triangular matrix-vector multiplication, with performance improvements over cuBLAS on an Nvidia GPU by up to 2×. Furthermore, we show a use case for a low-level optimization for avoiding unnecessary out-ofbound checks in stencils, leading to up to 3× improvements over already optimized generated stencil codes.

“…To explore different algorithmic optimization choices, we encode the optimizations discussed in section 5.3 plus 1D and 2D register blocking, and tiling presented by others [22]. Starting from the high-level expression in Listing 1, we apply these rewrite rules at all valid locations in an arbitrary order.…”

Section: Automatic Explorationmentioning

confidence: 99%

“…These rules encode algorithmic transformations as well as hardware-specific low-level optimizations. Recent work [22] has shown that this generic compiler approach leads to high performance for desktop-class GPUs from AMD and Nvidia.…”

Section: Introductionmentioning

confidence: 99%

Matrix multiplication beyond auto-tuning

Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Remmelg

Dubach

2016

Self Cite

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices.However, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs.Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics.In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a highlevel programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs. In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.