Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit 2016
DOI: 10.1145/2884045.2884046
|View full text |Cite
|
Sign up to set email alerts
|

Performance portable GPU code generation for matrix multiplication

Abstract: Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.Achieving performance portability is the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
3

Relationship

4
4

Authors

Journals

citations
Cited by 20 publications
(13 citation statements)
references
References 21 publications
0
13
0
Order By: Relevance
“…Our exploration process is divided into two phases: (1) Rewriting, and (2) Auto-tuning. For this evaluation, our existing rewriting strategy [42] was used without making any adjustments. In the first phase, a derivation tree was created by applying multiple potentially applicable rules, each creating a separate branch.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Our exploration process is divided into two phases: (1) Rewriting, and (2) Auto-tuning. For this evaluation, our existing rewriting strategy [42] was used without making any adjustments. In the first phase, a derivation tree was created by applying multiple potentially applicable rules, each creating a separate branch.…”
Section: Methodsmentioning
confidence: 99%
“…This design makes it easy to extend and add new optimizations into the compiler, whereas in Delite optimizations are hard-coded for each backend. A more detailed discussion about this process can be found in our previous work [42].…”
Section: The Real Challenge: Universal High Performance Code Generationmentioning
confidence: 99%
“…Introducing Partition via Rewriting Rules One of the core ideas underpinning Lift is the use of an automated exploration system that uses rewriting rules to automatically generate high performance code. A rewrite rule is a semantic preserving transformation of expressions, and is Lift's way to express optimization choices that are automatically explored in the optimization process using stochastic methods, as explained by [15].…”
Section: Partitionmentioning
confidence: 99%
“…To explore different algorithmic optimization choices, we encode the optimizations discussed in section 5.3 plus 1D and 2D register blocking, and tiling presented by others [22]. Starting from the high-level expression in Listing 1, we apply these rewrite rules at all valid locations in an arbitrary order.…”
Section: Automatic Explorationmentioning
confidence: 99%
“…These rules encode algorithmic transformations as well as hardware-specific low-level optimizations. Recent work [22] has shown that this generic compiler approach leads to high performance for desktop-class GPUs from AMD and Nvidia.…”
Section: Introductionmentioning
confidence: 99%