Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques 2020
DOI: 10.1145/3410463.3414632
|View full text |Cite
|
Sign up to set email alerts
|

Fireiron

Abstract: High GPU performance can only be achieved if a kernel efficiently uses the multi-layered compute and memory hierarchies. For example, accelerators such as NVIDIA's Tensor Cores require specific mappings of threads to data that must be considered in data movements to and from registers. Current compilers struggle to match the performance of vendor libraries like cuBLAS, which are developed by experts in assembly. This manual low-level coding is time-consuming and complicates to unlock the full GPU potential, pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 14 publications
references
References 19 publications
0
0
0
Order By: Relevance