Generating GPU Code from a High-Level Representation for Image Processing Kernels

Membarth, Richard; Lokhmotov, Anton; Teich, Jürgen

doi:10.1007/978-3-642-29737-3_31

Cited by 7 publications

(4 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been used successfully in several works (e.g., Membarth et al [2011]). The specification contains a description of the iteration space I, a precedence relationship R to set order of execution, a partition P to indicate sets of iterations preferably executed on a single processing element, and a set of memory locations that may be read (M r ) or written (M w ) for a given iteration.…”

Section: Mathematical Code Representationsmentioning

confidence: 99%

Algorithmic species

Nugteren

Custers

Corporaal

2013

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly parallelism, requiring programmers and compilers to deal with aspects such as threading, concurrency, synchronization, and complex memory partitioning. We advocate that programmers and compilers can greatly benefit from a structured classification of program code. Such a classification can help programmers to find opportunities for parallelization, reason about their code, and interact with other programmers. Similarly, parallelising compilers and source-to-source compilers can take threading and optimization decisions based on the same classification. In this work, we introduce algorithmic species, a classification of affine loop nests based on the polyhedral model and targeted for both automatic and manual use. Individual classes capture information such as the structure of parallelism and the data reuse. To make the classification applicable for manual use, a basic vocabulary forms the base for the creation of a set of intuitive classes. To demonstrate the use of algorithmic species, we identify 115 classes in a benchmark set. Additionally, we demonstrate the suitability of algorithmic species for automated uses by showing a tool to automatically extract species from program code, a species-based source-to-source compiler, and a species-based performance prediction model.

show abstract

Section: Mathematical Code Representationsmentioning

confidence: 99%

Algorithmic species

Nugteren

Custers

Corporaal

2013

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…As is common in GPU compute programming, we assume that code has already been written or generated [23,24,25] and that parameters like size of thread blocks, size of thread groups and other GPU-specific parameters are optimized based on the specific architecture of the GPU.…”

Section: Related Work and Our Contri-butionsmentioning

confidence: 99%

Saving energy without defying deadlines on mobile GPU-based heterogeneous systems

Maghazeh

Bordoloi

Horga

et al. 2014

Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis

View full text Add to dashboard Cite

With the advent of low-power programmable compute cores based on GPUs, GPU-equipped heterogeneous platforms are becoming common in a wide spectrum of industries including safety-critical domains like the automotive industry. While the suitability of GPUs for throughput oriented applications is well-accepted, their applicability for real-time applications remains an open issue. Moreover, in mobile/embedded systems, energy-efficient computing is a major concern and yet, there has been no systematic study on the energy savings that GPUs may potentially provide. In this paper, we propose an approach to utilize both the GPU and the CPU in a heterogeneous fashion to meet the deadlines of a real-time application while ensuring that we maximize the energy savings. We note that GPUs are inherently built to maximize the throughput and this poses a major challenge when deadlines must be satisfied. The problem becomes more acute when we consider the fact that GPUs are more energy efficient than CPUs and thus, a naive approach that is based on maximizing GPU utilization might easily lead to infeasible solutions from a deadline perspective.

show abstract

“…In this paper, we use and extend the HIPA cc framework to design image processing kernels [2], [3]. The framework uses a source-to-source compiler based on Clang [7] in order to generate low-level, optimized CUDA and OpenCL code for execution on GPU accelerators.…”

Section: Image Processing Frameworkmentioning

confidence: 99%

“…While we previously presented the description and mapping of point and local operators [2], [3], the notation for global reduction operators is introduced in this paper. Global operators are memory bound and, thus, are first class candidates to investigate the proposed techniques.…”

Section: Introductionmentioning

confidence: 99%

Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

Membarth

Hannig

Teich

et al. 2012

2012 11th International Symposium on Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Abstract-An efficient memory bandwidth utilization for GPU accelerators is crucial for memory bound applications. In medical imaging, the performance of many kernels is limited by the available memory bandwidth since only a few operations are performed per pixel. For such kernels only a fraction of the compute power provided by GPU accelerators can be exploited and performance is predetermined by memory bandwidth. As a remedy, this paper investigates the optimal utilization of available memory bandwidth by means of increasing in-flight memory transactions. Instead of doing this manually for different GPU accelerators, the required CUDA and OpenCL code is automatically generated from descriptions in a Domain-Specific Language (DSL) for the considered application domain. Moreover, the DSL is extended to also support global reduction operators. We show that the generated target-specific code improves bandwidth utilization for memory-bound kernels significantly. Moreover, competitive performance compared to the GPU back end of the widely used image processing library OpenCV can be achieved.

show abstract

Generating GPU Code from a High-Level Representation for Image Processing Kernels

Cited by 7 publications

References 3 publications

Algorithmic species

Algorithmic species

Saving energy without defying deadlines on mobile GPU-based heterogeneous systems

Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

Contact Info

Product

Resources

About