Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis 2009
DOI: 10.1145/1654059.1654119
|View full text |Cite
|
Sign up to set email alerts
|

Automating the generation of composed linear algebra kernels

Abstract: Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have develope… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
65
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 44 publications
(65 citation statements)
references
References 51 publications
0
65
0
Order By: Relevance
“…Another different work seeks to update BLAS by extending it with additional functionalities [26]. Build-to-order BLAS [27] and Design-by-transformation BLAS [28] approach the problem from a different angle. Their goal is to generate optimized and tuned BLAS-like functions from high level kernel specifications.…”
Section: Related Workmentioning
confidence: 99%
“…Another different work seeks to update BLAS by extending it with additional functionalities [26]. Build-to-order BLAS [27] and Design-by-transformation BLAS [28] approach the problem from a different angle. Their goal is to generate optimized and tuned BLAS-like functions from high level kernel specifications.…”
Section: Related Workmentioning
confidence: 99%
“…Belter et al presented a technique to optimize linear algebra kernels specified in MATLAB syntax using loop fusion [5]. Our approach of forward substitution is reminiscent of their technique of inlining loops and fusing them aggressively.…”
Section: Related Workmentioning
confidence: 99%
“…Many compiler optimizations, e.g., those in Fig 2, have been shown to be highly effective for computations similar to the gemm kernel in Fig 1. However, the performance of the compiler-optimized code is often suboptimal when compared to those attained by manually optimized libraries, e.g., MKL [17], ACML [4], and ATLAS [29], which have been supplied by CPU vendors or HPC researchers, often with selected kernels directly implemented in assembly [8]. Developing highly optimized libraries manually, however, is excessively labor intensive and error prone.…”
Section: Introductionmentioning
confidence: 99%