2009
DOI: 10.1007/978-3-642-01970-8_89
|View full text |Cite
|
Sign up to set email alerts
|

A Note on Auto-tuning GEMM for GPUs

Abstract: Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13,11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is dif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
84
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 138 publications
(91 citation statements)
references
References 10 publications
0
84
0
Order By: Relevance
“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”
Section: Related Workmentioning
confidence: 99%
“…Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has already shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even hand-tuned kernels. We expand this early work, as described below, in the context of todays high-end GPGPU from NVIDIA and ATI, using both CUDA and OpenCL.…”
Section: Performance Portability With Auto-tuningmentioning
confidence: 99%
See 1 more Smart Citation
“…Two major auto-tuning approaches have emerged in the extensive literature covering the subject (see surveys in e.g. [Vuduc et al, 2001, Williams, 2008, Datta et al, 2008, Cavazos, 2008, Li et al, 2009, Park et al, 2011): analytical model-driven optimization and empirical optimization [Yotov et al, 2003].…”
Section: Auto-tuningmentioning
confidence: 99%
“…Auto-tuning techniques have been successfully used to automatically generate the kernels of various high-performance libraries such as ATLAS [4], FFTW, OSKI or SPIRAL; and similar results are obtained in the context of GPU computing by the MAGMA project [9]. While performance models permit to generate efficient computational kernels even on heterogeneous systems, computations are usually mapped statically on the different processing resources when dealing with hybrid systems [11].…”
Section: Related Workmentioning
confidence: 95%