Accelerating GPU Kernels for Dense Linear Algebra

Nath, Rajib; Tomov, Stanimire; Dongarra, Jack

doi:10.1007/978-3-642-19328-6_10

Cited by 39 publications

(29 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”

Section: Related Workmentioning

confidence: 99%

“…Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has already shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even hand-tuned kernels. We expand this early work, as described below, in the context of todays high-end GPGPU from NVIDIA and ATI, using both CUDA and OpenCL.…”

Section: Performance Portability With Auto-tuningmentioning

confidence: 99%

“…Previous work in the area has managed to auto-tune the selection of these parameters and optimizations used, to quickly find the best performing implementations for particular cases of GEMM [21,22]. However, with the introduction of the Fermi architecture, these auto-tuning frameworks were not able to find the new "optimal" implementations for Fermi, simply because their search space did not consider the newly introduced features in the architecture [29].…”

Section: Auto-tuning Infrastructurementioning

confidence: 99%

See 2 more Smart Citations

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

Self Cite

View full text Add to dashboard Cite

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Performance Portability With Auto-tuningmentioning

confidence: 99%

Section: Auto-tuning Infrastructurementioning

confidence: 99%

See 1 more Smart Citation

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similar efforts followed in the MAGMA project [25]. The introduction of the NVIDIA Fermi architecture triggered the development of MAGMA GEMM kernels tuned for that architecture [29], [28]. Although tuning was an important part of this work, it was accomplished through exhaustive experimentation rather than a systematic autotuning effort.…”

Section: Related Workmentioning

confidence: 99%

Autotuning GEMM Kernels for the Fermi GPU

Kurzak

Tomov

Dongarra

2012

IEEE Trans. Parallel Distrib. Syst.

Self Cite

102

View full text Add to dashboard Cite

Abstract-In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

show abstract

“…Because several generic optimization techniques including pointer redirection [19] and auto-tuning [20] were introduced, the performance of their kernel is a significant improvement over the CUBLAS 2.3. In this work they also presented a kernel for transposed matrix-vector multiplication, which like Fujimoto's kernel, allows groups of threads within a block to work together followed by a required reduction operation.…”

Section: Related Workmentioning

confidence: 99%

Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

Wang

Zhu

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYModern GPUs have evolved to become a more general processor capable of executing scientific and engineering computations. It provides a highly parallel computing environment due to its large number of computing cores, which are suitable for numerous data parallel arithmetic computations, particularly linear algebra operations. The matrixvector multiplication is one of the most important dense linear algebraic operations. It is applied to a diverse set of applications in many fields and must therefore be fully optimized to achieve a high-performance. In this paper, we proposed a novel auto-tuning method for matrix-vector multiplication on GPUs, where the number of assigned threads that are used to compute one element of the result vector can be auto-tuned according to the size of matrix. On the Nvidia's GPU GTX 650 with the most recent Kepler architecture, we developed an auto-tuner that can automatically select the optimal number of assigned threads for calculation. Based on the auto-tuner's result, we developed a versatile generic matrix-vector multiplication kernel with the CUDA programming model. A series of experiments on different shapes and sizes of matrices were conducted for comparing the performance of our kernel with that of the kernels from CUBLAS 5.0, MAGMA 1.3 and a warp method. The experiments results show that the performance of our matrix-vector multiplication kernel is close to the optimal behavior with increasing of the size of the matrix and has very little dependency on the shape of the matrix, which is a significant improvement compared to the other three kernels that exhibit unstable performance behavior for different shapes of matrices.

show abstract

Accelerating GPU Kernels for Dense Linear Algebra

Cited by 39 publications

References 3 publications

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Autotuning GEMM Kernels for the Fermi GPU

Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

Contact Info

Product

Resources

About