Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Yi, Qing; Wang, Qian; Cui, Huimin

doi:10.1109/micro.2014.14

Cited by 8 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several approaches can fully automatically optimize MMA[×, +] and obtain high-performance code that outperforms expert-tuned implementations. The POET optimization library (Yi et al 2014) and AUGEM framework (Wang et al 2013) use annotations and templates of sequential code, respectively, written by domain experts to guide general-purpose compilers to produce optimized MMA[×, +] kernels from specifically prepared code. The Portable Compiler Approach (POCA) (Su et al 2017) generates an optimized micro-kernel based on LLVM IR representing MMA [×, +] and subsequent domain-specific but architecture-independent optimizations of its micro-kernel.…”

Section: Automatic Optimization Of Mma[× +]mentioning

confidence: 99%

High-Performance Generalized Tensor Operations

Gareev

Grosser

Kruse

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available. CCS Concepts: • Software and its engineering → Compilers; • Computing methodologies → Linear algebra algorithms;

show abstract

Section: Automatic Optimization Of Mma[× +]mentioning

confidence: 99%

High-Performance Generalized Tensor Operations

Gareev

Grosser

Kruse

2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…We implemented our method in the OpenBLAS [3] library and evaluated it on Phytium 2000+, an emerging high-performance many-core processor based on Arm's AArch64 architecture. We restrict our evaluation to DGEMM, as in prior work [10][11][12], for two reasons. First, the basic idea of the hybrid-grained load-balancing method applies to other variants of GEMM such as SGEMM, CGEMM and ZGEMM.…”

Section: Resultsmentioning

confidence: 99%

“…ATLAS [1] adopts the auto-tuning method to automatically generate kernels with different parameters in C and find the best-performing one by running them on the actual computing system. POET [12,16,17] and AUGEM [11] use a directive-based programming approach. POCA [14] is a compiler-based approach which generates and optimize kernels automatically and portably.…”

Section: Related Workmentioning

confidence: 99%

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Lei

2018

Electronics

View full text Add to dashboard Cite

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

show abstract

Interactive Composition of Compiler Optimizations

Nesterenko

Wang

2016

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Cited by 8 publications

References 27 publications

High-Performance Generalized Tensor Operations

High-Performance Generalized Tensor Operations

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Interactive Composition of Compiler Optimizations

Contact Info

Product

Resources

About