2011
DOI: 10.1007/978-3-642-19328-6_10
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating GPU Kernels for Dense Linear Algebra

Abstract: Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting -a set of GPU specific optimization techniquesallows us to easily remove performance oscillations associated with problem dimensions not di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 39 publications
(29 citation statements)
references
References 3 publications
0
29
0
Order By: Relevance
“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels. The challenge of providing performance portability is by no means limited to linear algebra.…”
Section: Related Workmentioning
confidence: 99%
“…Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has already shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even hand-tuned kernels. We expand this early work, as described below, in the context of todays high-end GPGPU from NVIDIA and ATI, using both CUDA and OpenCL.…”
Section: Performance Portability With Auto-tuningmentioning
confidence: 99%
See 1 more Smart Citation
“…Similar efforts followed in the MAGMA project [25]. The introduction of the NVIDIA Fermi architecture triggered the development of MAGMA GEMM kernels tuned for that architecture [29], [28]. Although tuning was an important part of this work, it was accomplished through exhaustive experimentation rather than a systematic autotuning effort.…”
Section: Related Workmentioning
confidence: 99%
“…Because several generic optimization techniques including pointer redirection [19] and auto-tuning [20] were introduced, the performance of their kernel is a significant improvement over the CUBLAS 2.3. In this work they also presented a kernel for transposed matrix-vector multiplication, which like Fujimoto's kernel, allows groups of threads within a block to work together followed by a required reduction operation.…”
Section: Related Workmentioning
confidence: 99%