2018
DOI: 10.1145/3235029
|View full text |Cite
|
Sign up to set email alerts
|

High-Performance Generalized Tensor Operations

Abstract: The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be gen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
32
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
3
1

Relationship

4
3

Authors

Journals

citations
Cited by 26 publications
(33 citation statements)
references
References 44 publications
0
32
1
Order By: Relevance
“…Generalized matrix multiplication (GEMM BLAS kernel) is one of the important computation patterns and is the most commonly optimized kernel in history [31]. However, state-of-the-art compilers achieve only a fraction of the theoretical machine performance for a simple textbook-style implementation [15]. A recent improvement within Polly introduced a custom transformation for GEMM-like kernels that is controlled outside of the main affine scheduling mechanism [15].…”
Section: Hand-tuned Gemm-like Optimizationmentioning
confidence: 99%
See 3 more Smart Citations
“…Generalized matrix multiplication (GEMM BLAS kernel) is one of the important computation patterns and is the most commonly optimized kernel in history [31]. However, state-of-the-art compilers achieve only a fraction of the theoretical machine performance for a simple textbook-style implementation [15]. A recent improvement within Polly introduced a custom transformation for GEMM-like kernels that is controlled outside of the main affine scheduling mechanism [15].…”
Section: Hand-tuned Gemm-like Optimizationmentioning
confidence: 99%
“…However, state-of-the-art compilers achieve only a fraction of the theoretical machine performance for a simple textbook-style implementation [15]. A recent improvement within Polly introduced a custom transformation for GEMM-like kernels that is controlled outside of the main affine scheduling mechanism [15]. This transformation applies to a generalized case of tensor contraction of the form…”
Section: Hand-tuned Gemm-like Optimizationmentioning
confidence: 99%
See 2 more Smart Citations
“…Only the copy-in and copy-out to the local memory allocation modify the schedule tree; these are new statements that are inserted before, respectively after the code that uses them. Parts of the code already existed in Polly as part of its matrix-matrix multiplication optimization [30], but had to be generalized to arbitrary loops. To define an index function and size of the packed array, we use the bounding box technique from [31].…”
Section: F Polly As Loop-transformermentioning
confidence: 99%