Braedy Kuzma scite author profile

Braedy Kuzma

5Publications

7Citation Statements Received

79Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Alberta

Publications

Order By: Most citations

KernelFaRer

Carvalho

Kuzma

Korostelev

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code of an application, the highest performing solution is to replace the pattern with a call to the library. Idiom-recognition solutions in the past either required pattern matching machinery that was outside of the compilation framework or provided a very brittle solution that would fail even for minor variants in the pattern source code. This article introduces Kernel Find & Replacer ( KernelFaRer ), an idiom recognizer implemented entirely in the existing LLVM compiler framework. The versatility of KernelFaRer is demonstrated by matching and replacing two linear algebra idioms, general matrix-matrix multiplication (GEMM), and symmetric rank-2k update (SYR2K). Both GEMM and SYR2K are used extensively in scientific computation, and GEMM is also a central building block for deep learning and computer graphics algorithms. The idiom recognition in KernelFaRer is much more robust than alternative solutions, has a much lower compilation overhead, and is fully integrated in the broadly used LLVM compilation tools. KernelFaRer replaces existing GEMM and SYR2K idioms with computations performed by BLAS, Eigen, MKL (Intel’s x86), ESSL (IBM’s PowerPC), and BLIS (AMD). Gains in performance that reach 2000× over hand-crafted source code compiled at the highest optimization level demonstrate that replacing application code with library call is a performant solution.

show abstract

Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering

et al. 2023

View full text Add to dashboard Cite

The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers -for data (re)organization -and micro kernels that perform the actual computations. The algorithm for the tiling and packing layers is target independent but is parameterized to the memory hierarchy and register-file size. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM ® 's POWER10 ™ (Matrix Multiply Assist -MMA), Intel ® (Advanced Matrix eXtensions -AMX), and Arm ® (Matrix Extensions -ME) to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The parameterization of the tiling and packing layers is demonstrated in the generation of code for the MMA unit on IBM's POWER10. This paper also describes an algorithm that lowers the matrix-multiply intrinsic to the MMA unit. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22× (Intel) -for small matrices -and more than 6× (POWER9) -for large matrices -faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34% slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6× faster the vector-extension solution, matches Eigen performance, and achieves up to 96% of BLAS peak performance.

show abstract

Compiler-Only Code Generation for Performant and Modular Matrix-Multiplication Micro Kernels Using Matrix Engines

Kuzma¹

2021

View full text Add to dashboard Cite

Learning to select mates in artificial life

Ashley

Chockalingam

Kuzma

et al. 2019

View full text Add to dashboard Cite

Acceleration Opportunities in Linear Algebra Applications via Idiom Recognition

Carvalho

Kuzma

Araújo

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Braedy Kuzma

KernelFaRer

Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering

Compiler-Only Code Generation for Performant and Modular Matrix-Multiplication Micro Kernels Using Matrix Engines

Learning to select mates in artificial life

Acceleration Opportunities in Linear Algebra Applications via Idiom Recognition

Contact Info

Product

Resources

About