Automating the generation of composed linear algebra kernels

Belter, Geoffrey; Jessup, Elizabeth R.; Karlin, Ian; Siek, Jeremy G.

doi:10.1145/1654059.1654119

Cited by 44 publications

(65 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another different work seeks to update BLAS by extending it with additional functionalities [26]. Build-to-order BLAS [27] and Design-by-transformation BLAS [28] approach the problem from a different angle. Their goal is to generate optimized and tuned BLAS-like functions from high level kernel specifications.…”

Section: Related Workmentioning

confidence: 99%

A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms

Tan

Tang

Goh

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Directive-based programming approaches such as OpenMP and OpenACC have gained popularity due to their ease of programming. These programming models typically involve adding compiler directives to code sections such as loops in order to parallelize them for execution on multicore CPUs or GPUs. However, one problem with this approach is that existing compilers generate code directly from the annotated sections and do not make use of hardware-specific architectural features. As a result, the generated code is unable to fully exploit the capabilities of the underlying hardware. Alternatively, we propose a code generation framework in which linear algebraic operations in the annotated codes are recognized, extracted and mapped to optimized vendor-provided platform-specific library calls. We demonstrate that such an approach can result in better performance in the generated code compared to those which are generated by existing compilers. This is substantiated by experimental results on multicore CPUs and GPUs.

show abstract

Section: Related Workmentioning

confidence: 99%

A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms

Tan

Tang

Goh

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Belter et al presented a technique to optimize linear algebra kernels specified in MATLAB syntax using loop fusion [5]. Our approach of forward substitution is reminiscent of their technique of inlining loops and fusing them aggressively.…”

Section: Related Workmentioning

confidence: 99%

MATLAB Parallelization through Scalarization

Shei

Yoga

Ramesh

et al. 2011

2011 15th Workshop on Interaction Between Compilers and Computer Architectures

View full text Add to dashboard Cite

While the popularity of using high-level programming languages such as MATLAB for scientific and engineering applications continues to grow, its poor performance compared to traditional languages such as Fortran or C continues to impede its deployment in full-scale simulations and data analysis. Additionally, its poor memory performance limits its performance. To ameliorate performance, we have been developing a MATLAB and Octave compiler that improves performance of MATLAB code by performing type inference and using the resulting type information to remove common bottlenecks. We observe that unlike past results, scalarizing array statements, instead of vectorizing scalar statements, is more fruitful when compiling MATLAB to C or C++. Two important situations where such scalarization helps is in expressions containing array subscripts and sequences of related array statements. In both cases, it is possible to generate fused loops and replace array temporaries by scalars, thus reducing the memory bandwidth pressure. Additional array temporaries are obviated in the case of array subscripts. Further, starting with vectorized statements guarantees that the resulting loops can be parallelized, creating opportunities for a mix of thread-level and instruction-level parallelism as well as GPU execution. We have implemented this strategy in a MATLAB compiler that compiles portions of MATLAB to C++ or CUDA C. Evaluation results on a set of benchmarks selected from diverse domains shows speed improvements ranging from 1.5x to almost 17x on an eight-core Intel Core 2 Duo machine.

show abstract

“…Many compiler optimizations, e.g., those in Fig 2, have been shown to be highly effective for computations similar to the gemm kernel in Fig 1. However, the performance of the compiler-optimized code is often suboptimal when compared to those attained by manually optimized libraries, e.g., MKL [17], ACML [4], and ATLAS [29], which have been supplied by CPU vendors or HPC researchers, often with selected kernels directly implemented in assembly [8]. Developing highly optimized libraries manually, however, is excessively labor intensive and error prone.…”

Section: Introductionmentioning

confidence: 99%

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Wang

Cui

2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

General purpose compilers aim to extract the best average performance for all possible user applications. Due to the lack of specializations for different types of computations, compiler attained performance often lags behind those of the manually optimized libraries. In this paper, we demonstrate a new approach, programmable composition, to enable the specialization of compiler optimizations without compromising their generality. Our approach uses a single pass of sourcelevel analysis to recognize a common pattern among dense matrix computations. It then tags the recognized patterns to trigger a sequence of general-purpose compiler optimizations specially composed for them. We show that by allowing different optimizations to adequately communicate with each other through a set of coordination handles and dynamic tags inserted inside the optimized code, we can specialize the composition of general-purpose compiler optimizations to attain a level of performance comparable to those of manually written assembly code by experts, thereby allowing selected computations in applications to benefit from similar levels of optimizations as those manually applied by experts.

show abstract

Automating the generation of composed linear algebra kernels

Cited by 44 publications

References 51 publications

A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms

A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms

MATLAB Parallelization through Scalarization

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Contact Info

Product

Resources

About