2011 IEEE International Parallel &Amp; Distributed Processing Symposium 2011
DOI: 10.1109/ipdps.2011.33
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Library Generation for BLAS3 on GPUs

Abstract: Abstract-High-performance libraries, the performancecritical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives.We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
19
0

Year Published

2011
2011
2017
2017

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(19 citation statements)
references
References 33 publications
0
19
0
Order By: Relevance
“…Cui et al [11] presented a similar system built using the Open64 compiler [36] and the WRaP-IT/URUK/URGenT polyhedral toolchain [10]. Here the authors started with optimized MAGMA/CUBLAS Fermi SGEMM kernels (A Â B; A T Â B; A Â B T ; A T Â B T ) and used automatic code transformations to extrapolate the SGEMM performance to the other three Level 3 BLAS kernels (STRMM, STRSM, SSYMM) with all combinations of inputs covered (left/ right, lower/upper).…”
Section: Related Workmentioning
confidence: 99%
“…Cui et al [11] presented a similar system built using the Open64 compiler [36] and the WRaP-IT/URUK/URGenT polyhedral toolchain [10]. Here the authors started with optimized MAGMA/CUBLAS Fermi SGEMM kernels (A Â B; A T Â B; A Â B T ; A T Â B T ) and used automatic code transformations to extrapolate the SGEMM performance to the other three Level 3 BLAS kernels (STRMM, STRSM, SSYMM) with all combinations of inputs covered (left/ right, lower/upper).…”
Section: Related Workmentioning
confidence: 99%
“…A number of general-purpose source-to-source compilers [6,9,8,10,17] can also generate highly efficient low-level C code for various DLA routines. These systems typically use empirical tuning to automatically experiment with different optimization choices and select those that perform the best.…”
Section: Related Workmentioning
confidence: 99%
“…While having similarities with our approach, it mostly considers pre-optimized codes and relies on optimization hints provided by the programmer. Cui et al [4] also sollicit the programmers to provide hints on code portions that present similarities in their performance characteristics. Compared to these approaches, we try to automatize the runtime selection.…”
Section: Related Workmentioning
confidence: 99%