A Note on Auto-tuning GEMM for GPUs

Li, Yinan; Dongarra, Jack; Tomov, Stanimire

doi:10.1007/978-3-642-01970-8_89

Cited by 138 publications

(91 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”

Section: Related Workmentioning

confidence: 99%

“…Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has already shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even hand-tuned kernels. We expand this early work, as described below, in the context of todays high-end GPGPU from NVIDIA and ATI, using both CUDA and OpenCL.…”

Section: Performance Portability With Auto-tuningmentioning

confidence: 99%

“…Previous work in the area has managed to auto-tune the selection of these parameters and optimizations used, to quickly find the best performing implementations for particular cases of GEMM [21,22]. However, with the introduction of the Fermi architecture, these auto-tuning frameworks were not able to find the new "optimal" implementations for Fermi, simply because their search space did not consider the newly introduced features in the architecture [29].…”

Section: Auto-tuning Infrastructurementioning

confidence: 99%

See 2 more Smart Citations

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

Self Cite

View full text Add to dashboard Cite

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Performance Portability With Auto-tuningmentioning

confidence: 99%

Section: Auto-tuning Infrastructurementioning

confidence: 99%

See 1 more Smart Citation

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

Self Cite

View full text Add to dashboard Cite

show abstract

“…Two major auto-tuning approaches have emerged in the extensive literature covering the subject (see surveys in e.g. [Vuduc et al, 2001, Williams, 2008, Datta et al, 2008, Cavazos, 2008, Li et al, 2009, Park et al, 2011): analytical model-driven optimization and empirical optimization [Yotov et al, 2003].…”

Section: Auto-tuningmentioning

confidence: 99%

Machine learning for predictive auto-tuning with boosted regression trees

Bergstra

Pinto

Cox

2012

2012 Innovative Parallel Computing (InPar)

View full text Add to dashboard Cite

The rapidly evolving landscape of multicore architectures makes the construction of efficient libraries a daunting task. A family of methods known collectively as "auto-tuning" has emerged to address this challenge. Two major approaches to auto-tuning are empirical and model-based: empirical autotuning is a generic but slow approach that works by measuring runtimes of candidate implementations, model-based auto-tuning predicts those runtimes using simplified abstractions designed by hand. We show that machine learning methods for non-linear regression can be used to estimate timing models from data, capturing the best of both approaches. A statistically-derived model offers the speed of a model-based approach, with the generality and simplicity of empirical auto-tuning. We validate our approach using the filterbank correlation kernel described in Pinto and Cox [2012], where we find that 0.1 seconds of hill climbing on the regression model ("predictive auto-tuning") can achieve almost the same speed-up as is brought by minutes of empirical auto-tuning. Our approach is not specific to filterbank correlation, nor even to GPU kernel auto-tuning, and can be applied to almost any templated-code optimization problem, spanning a wide variety of problem types, kernel types, and platforms.

show abstract

“…Auto-tuning techniques have been successfully used to automatically generate the kernels of various high-performance libraries such as ATLAS [4], FFTW, OSKI or SPIRAL; and similar results are obtained in the context of GPU computing by the MAGMA project [9]. While performance models permit to generate efficient computational kernels even on heterogeneous systems, computations are usually mapped statically on the different processing resources when dealing with hybrid systems [11].…”

Section: Related Workmentioning

confidence: 95%

Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Augonnet

Thibault

Namyst

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Multicore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling -as opposed to offloading -parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit [2]. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels.

show abstract

A Note on Auto-tuning GEMM for GPUs

Cited by 138 publications

References 10 publications

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Machine learning for predictive auto-tuning with boosted regression trees

Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Contact Info

Product

Resources

About