Meta-programming and auto-tuning in the search for high performance GPU code

Vollmer, Michael; Svensson, Bo Joel; Holk, Eric; Newton, Ryan

doi:10.1145/2808091.2808092

Cited by 8 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, it is important to propose and evaluate different strategies for helping to choose a better number of threads per block configuration. The available literature does not agree on the optimal strategy for choosing the number of threads per block 20,21 . The most common strategies involve using the maximum number of threads per block supported by the GPU 22 or a fixed number of threads per block chosen by the programmer 21,23 .…”

Section: Methodsmentioning

confidence: 99%

“…The available literature does not agree on the optimal strategy for choosing the number of threads per block. 20,21 The most common strategies involve using the maximum number of threads per block supported by the GPU 22 or a fixed number of threads per block chosen by the programmer. 21,23 While using fixed numbers like the warp size or the maximum number of threads supported by the GPU are useful to identify bottlenecks in a GPU kernel, these numbers may not be the best possible configuration for the thread block size.…”

Section: Strategies For Choosing the Number Of Threads Per Blockmentioning

confidence: 99%

“…Another possibility is to evaluate different numbers of threads per block and check the impact of this parameter in the program performance in order to identify a better configuration. 20 Therefore, we classified this configuration space in five strategies that were tested for the GPU functions discussed in Section 3.1. We named and described the strategies as follows:…”

Section: Strategies For Choosing the Number Of Threads Per Blockmentioning

confidence: 99%

See 2 more Smart Citations

NAS Parallel Benchmarks with CUDA and beyond

et al. 2021

View full text Add to dashboard Cite

NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade‐off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Strategies For Choosing the Number Of Threads Per Blockmentioning

confidence: 99%

Section: Strategies For Choosing the Number Of Threads Per Blockmentioning

confidence: 99%

See 1 more Smart Citation

NAS Parallel Benchmarks with CUDA and beyond

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The approaches in both [12] and [24] generate OpenCL GPU code from data-parallel software inputs, profiling to map computations onto CPU or GPU targets. In [22], Haskell meta-programs tune GPU kernel launch parameters for designs expressed in an embedded DSL. In [23] automatic source-to-source transformations optimise CUDA stencil computations.…”

Section: Related Workmentioning

confidence: 99%

Meta-Programming Design-Flow Patterns for Automating Reusable Optimisations

Vandebon

Coutinho

Luk

2022

International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

View full text Add to dashboard Cite

Continuing advances in heterogeneous and parallel computing enable massive performance gains in domains such as AI and HPC. Such gains often involve using hardware accelerators, such as FP-GAs and GPUs, to speed up specific workloads. However, to make effective use of emerging heterogeneous architectures, optimisation is typically done manually by highly-skilled developers with in-depth understanding of the target hardware. The process is tedious, error-prone, and must be repeated for each new application. This paper introduces Design-Flow Patterns, which capture modular, recurring application-agnostic elements involved in mapping and optimising application descriptions onto efficient CPU and GPU targets. Our approach is the first to codify and programmatically coordinate these elements into fully automated, customisable, and reusable end-to-end design-flows. We implement key design-flow patterns using the meta-programming tool Artisan, and evaluate automated design-flows applied to three sequential C++ applications. Compared to single-threaded implementations, our approach generates multi-threaded OpenMP CPU designs achieving up to 18 times speedup on a CPU platform with 32-threads, as well as HIP GPU designs achieving up to 1184 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU. CCS CONCEPTS• Computer systems organization → Heterogeneous (hybrid) systems; Multicore architectures; • Software and its engineering → Compilers.

show abstract

“…One of the most prominent examples can be found in Haskell, where the Accelerate, Obsidian and Nikola libraries (just to name a few) provide GPU utilization primitives. These usually encode array operations in an EDSL way giving variable number of primitives and usually code generation to low level constructs or GPU intermediate language (IL) [2][3][4][5]. Dedicated FP languages were proposed like NOVA, from NVIDIA [6].…”

Section: Related Workmentioning

confidence: 99%

C++ EDSL for parallel code generation

Berényi

2015

2015 Conference Grid, Cloud &Amp; High Performance Computing in Science (ROLCG)

View full text Add to dashboard Cite

Code generation is ubiquitous for modern highperformance computing (HPC) to provide efficient but highly parametrizable program development. Many times functional dependencies should be made available for the user to manipulate, and such arbitrary functions should be efficiently parallelized over multiple levels. We propose an embedded domain specific language inside C++ for manipulating abstract syntax trees (ASTs) that can represent arbitrary computation, and that such language can be extended with constructs for parallelism and functional programming.

show abstract

Meta-programming and auto-tuning in the search for high performance GPU code

Cited by 8 publications

References 20 publications

NAS Parallel Benchmarks with CUDA and beyond

NAS Parallel Benchmarks with CUDA and beyond

Meta-Programming Design-Flow Patterns for Automating Reusable Optimisations

C++ EDSL for parallel code generation

Contact Info

Product

Resources

About