AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki, Matsumura; Zohouri, Hamid Reza; Wahib, Mohamed; Endo, Tetsuro; Matsuoka, Satoshi

doi:10.1145/3368826.3377904

Cited by 45 publications

(35 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rawat et al [51] introduced a domain-specific language called "STENCILGEN" to describe and generate optimized GPU code for stencil computations, leveraging multiple tiling techniques. AN5D [42] is another framework for automatic generation of optimized stencil GPU code, from generic C code. It also depends on different forms of temporal blocking.…”

Section: Related Workmentioning

confidence: 99%

Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation

Abdelaal

Kong

2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory.In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures. CCS CONCEPTS• Software and its engineering → Compilers; • General and reference → Performance; • Mathematics of computing → Combinatorial optimization; • Computer systems organization → Parallel architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation

Abdelaal

Kong

2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…For example, the authors of [4] use a model to optimize the computation/register ratio, which is important for the class of stencils they are targeting. In [5], a standard roofline model with a fixed, theoretical memory volume is used for a full exploration of the configuration space, followed by benchmarking the top five candidates.…”

Section: Related Workmentioning

confidence: 99%

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Ernst¹,

Hager²,

Holzer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning.This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy.Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the "pystencils" stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate.The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

show abstract

“…There is a rich literature describing efforts to efficiently implement stencil computations on CPUs [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [15], [16], [12] and GPUs [13], [14], [17], [18], [19], [22], [23]. We discuss the most related efforts below.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, as part of their AN5D framework work, Matsumura et al [19] apply three more refinements to 2.5D and 3.5D solutions: fixed register allocations, double buffering, and division of the streaming dimension. While these approaches work extremely well for simple single-statement kernels, neither boundary conditions nor multi-statement stencils are evaluated.…”

Section: Related Workmentioning

confidence: 99%

Accelerating High-Order Stencils on GPUs

Sai

Mellor-Crummey

Meng

et al. 2020

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

View full text Add to dashboard Cite

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of the techniques work well for high-order stencils, such as those used for seismic imaging. Furthermore, coping with boundary conditions often requires different computational logic, which complicates efficient exploitation of the threadlevel parallelism on GPUs. In this paper, we study practical seismic imaging computations on GPUs using high-order stencils on large domains with meaningful boundary conditions. We manually crafted a collection of implementations of a 25-point seismic modeling stencil in CUDA along with code to apply the boundary conditions. We evaluated our stencil code shapes, memory hierarchy usage, data-fetching patterns, and other performance attributes. We conducted an empirical evaluation of these stencils using several mature and emerging tools and discuss our quantitative findings. Among our implementations, we achieve twice the performance of a proprietary code developed in C and mapped to GPUs using OpenACC. Additionally, several of our implementations have excellent performance portability.

show abstract

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Cited by 45 publications

References 37 publications

Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation

Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Accelerating High-Order Stencils on GPUs

Contact Info

Product

Resources

About