MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Stratton, John A.; Stone, Sam S.; Hwu, Wen-mei W.

doi:10.1007/978-3-540-89740-8_2

Cited by 159 publications

(107 citation statements)

References 11 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…Another project, MCUDA [16], applied code transformations to CUDA kernels, enabling them to run efficiently on multicore CPUs. Unforunately for legacy code maintainers, the reverse operation -porting multicore code to GPUs -proved difficult [14].…”

Section: Related Workmentioning

confidence: 99%

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

View full text Add to dashboard Cite

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.

show abstract

Section: Related Workmentioning

confidence: 99%

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Ravi et al [29] rely on the molding technique (changing the dimensions of grid and thread blocks while preserving the correctness of the computation), when possible. Pai et al [27] propose a similar technique and associated code transformation based on iterative wrapping [35] that produces an elastic kernel. These techniques rely on developer or compiler transformation to prepare the programs for concurrent execution.…”

Section: Related Workmentioning

confidence: 99%

Enabling preemptive multiprogramming on GPUs

Tanasic

Gelado

Cabezas

et al. 2014

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.

show abstract

“…break, continue and return). A loop-fission technique proposed in [10] is used to break the kernel-wide thread-loop into localized thread-loops which do not cross any of the synchronization directives encountered in the code. Fig.…”

Section: Fcuda Front-end Transformationmentioning

confidence: 99%

“…This way serialized execution of threads maintains the thread-block synchronization semantics. FCUDA extends the MCUDA [10] implementation of loop-fission by adding COMPUTE and TRANSFER pragmas to the list of synchronization directives. COMPUTE and TRANSFER pragmas are used by the FPGA programmer to annotate computation and off-chip data communication tasks.…”

Section: Fcuda Front-end Transformationmentioning

confidence: 99%

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Papakonstantinou

Gururaj

Stratton

et al. 2009

2009 IEEE 7th Symposium on Application Specific Processors

Self Cite

125

View full text Add to dashboard Cite

Abstract-As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is often not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPGPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

show abstract

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Cited by 159 publications

References 11 publications

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Enabling preemptive multiprogramming on GPUs

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Contact Info

Product

Resources

About