Polyhedral parallel code generation for CUDA

Verdoolaege, Sven; Juega, Juan Carlos; Cohen, Albert; Gómez, José Ignacio Aguaded; Tenllado, Christian; Catthoor, Francky

doi:10.1145/2400682.2400713

Cited by 269 publications

(153 citation statements)

References 37 publications

Supporting

Mentioning

153

Contrasting

Order By: Relevance

“…An annotation approach is described in [6], based on the Platform-Neutral Compute Intermediate Language [4]. This used the code generator in [35] to generate CUDA and OpenCL code for multiple compute platforms.…”

Section: Related Workmentioning

confidence: 99%

Discovery and exploitation of general reductions: A constraint based approach

Ginsbach¹,

O’Boyle²

2017

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Discovering and exploiting scalar reductions in programs has been studied for many years. The discovery of more complex reduction operations has, however, received less attention. Such reductions contain compile-time unknown parameters, indirect memory accesses and dynamic control flow, which are challenging for existing approaches.In this paper we develop a new compiler based approach that automatically detects a wide class of reductions. This is based on a constraint formulation of the reduction idiom and has been implemented as an LLVM pass. We use a custom constraint solver to identify program subsets that adhere to the constraint specification. Once discovered, we automatically generate parallel code to exploit the reduction.This approach is robust and was evaluated on C versions of well known benchmark suites: NAS, Parboil and Rodinia. We detected 84 scalar reductions and 6 histograms, outperforming existing approaches. We show that the exploitation of histograms gives significant performance improvement.

show abstract

Section: Related Workmentioning

confidence: 99%

Discovery and exploitation of general reductions: A constraint based approach

Ginsbach¹,

O’Boyle²

2017

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

show abstract

“…This model has been well studied and numerous source-to-source compilation tools have evolved, such as PluTo [5], PPCG [27], Par4ALL [25], or the ROSE compiler infrastructure [23] with its PolyOpt/C optimizer. These frameworks traditionally aim for an automatic OpenMP and SIMD parallelization of sequential CPU codes; some (e.g., PPCG) are also capable of generating CUDA or OpenCL code for GPUs.…”

Section: B Parallelization Toolsmentioning

confidence: 99%

“…Existing tools, such as Par4all, PIPS, and PluTo, are able to parallelize sequential program parts at certain conditions [3], [6], [25], [27]. For instance, PluTo is capable of transforming a nested loop if it is polyhedral, i.e., all array accesses within the loop are affine functions of the loop iterators (for details see Section III-0c).…”

Section: Introductionmentioning

confidence: 99%

Pure Functions in C: A Small Keyword for Automatic Parallelization

SuB

Nagel

Vef

et al. 2017

2017 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

“…The most important thing is to organize the available resources of GPU properly. When the GPU resource is well organized, CPU can launch a kernel function to GPU to start computing [32].…”

Section: Some Optimization Principles For Gpu Programmingmentioning

confidence: 99%

“…For threads in different block, the communication must go through the global memory, which will slow down the calculation speed. The general optimization principles of GPU programming can be summarized as [23,32]: -More threads are better, so as to deemphasize the memory access delay; -Avoid access of global memory; -Try to organize threads within one block. Note the number of threads within one block should be an integer multiple of the number within one warp; -Try to reduce the communication between device and host to avoid long delay.…”

Section: Some Optimization Principles For Gpu Programmingmentioning

confidence: 99%

GPU-based image method for room impulse response calculation

Zhang

2015

Multimed Tools Appl

View full text Add to dashboard Cite

Room impulse response (RIR) simulation based on the image-source method is widely used in room acoustic research. The calculation of the RIR in computer has to digitalize sound propagation delay into discrete samples. To carefully consider the digitalization error greatly increases the massive computational load of the image-source method. Therefore many real-time audio applications simply round-off the propagation delay to its nearest sample. This approximation, however, especially when the sampling frequency is low, degrades the phase precision that is required by applications such as microphone array. In this paper, by involving a Hanning-windowed ideal low-pass filter to reduce the digitalization error, a more precise image-source model is studied. We analyze its parallel calculation procedure and propose to use Graphics Processing Unit (GPU) to accelerate the calculation speed. The calculation procedure is divided into many parallel threads and arranged according the GPU architecture and its optimization criteria. We evaluate the calculation speeds of different RIRs using a general 5-core CPU, an ordinary GPU (GTX750) and an advanced GPU (K20C). The results show that, with similar precise RIR results, the speedup ratios of GTX750 and K20C over the general CPU can achieve 20 and 120 respectively.

show abstract

Polyhedral parallel code generation for CUDA

Cited by 269 publications

References 37 publications

Discovery and exploitation of general reductions: A constraint based approach

Discovery and exploitation of general reductions: A constraint based approach

Pure Functions in C: A Small Keyword for Automatic Parallelization

GPU-based image method for room impulse response calculation

Contact Info

Product

Resources

About