Automatic Library Generation for BLAS3 on GPUs

Cui, Huimin; Wang, Lei; Xue, Jingling; Yang, Yang; Feng, Xiaobing

doi:10.1109/ipdps.2011.33

Cited by 25 publications

(19 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cui et al [11] presented a similar system built using the Open64 compiler [36] and the WRaP-IT/URUK/URGenT polyhedral toolchain [10]. Here the authors started with optimized MAGMA/CUBLAS Fermi SGEMM kernels (A Â B; A T Â B; A Â B T ; A T Â B T ) and used automatic code transformations to extrapolate the SGEMM performance to the other three Level 3 BLAS kernels (STRMM, STRSM, SSYMM) with all combinations of inputs covered (left/ right, lower/upper).…”

Section: Related Workmentioning

confidence: 99%

Autotuning GEMM Kernels for the Fermi GPU

Kurzak

Tomov

Dongarra

2012

IEEE Trans. Parallel Distrib. Syst.

106

View full text Add to dashboard Cite

Abstract-In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

show abstract

Section: Related Workmentioning

confidence: 99%

Autotuning GEMM Kernels for the Fermi GPU

Kurzak

Tomov

Dongarra

2012

IEEE Trans. Parallel Distrib. Syst.

106

View full text Add to dashboard Cite

show abstract

“…A number of general-purpose source-to-source compilers [6,9,8,10,17] can also generate highly efficient low-level C code for various DLA routines. These systems typically use empirical tuning to automatically experiment with different optimization choices and select those that perform the best.…”

Section: Related Workmentioning

confidence: 99%

Augem

Wang

Zhang

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

137

View full text Add to dashboard Cite

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our templatebased approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

show abstract

“…While having similarities with our approach, it mostly considers pre-optimized codes and relies on optimization hints provided by the programmer. Cui et al [4] also sollicit the programmers to provide hints on code portions that present similarities in their performance characteristics. Compared to these approaches, we try to automatize the runtime selection.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Runtime Selection for GPU

Dollinger¹,

Loechner²

2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

It is often hard to predict the performance of a statically generated code. Hardware availability, hardware specification and problem size may change from one execution context to another. The main contribution of this work is an entirely automatic method aiming to predict execution times of semantically equivalent versions of affine loop nests on GPUs; then, to run the best performing one on GPU or CPU.To make accurate predictions, our framework relies on three consecutive stages: a static code generation, an offline profiling and an online prediction. Different versions are statically generated by PPCG, a source-to-source polyhedral compiler, able to generate CUDA code from static control loops written in C. The code versions differ by their block sizes, tiling and parallel schedule. The profiling code carries out the required measurements on the target machine: throughput between host and device memory, and execution time of the kernels with various parameters. At runtime, we rely on those results to calculate a predicted execution time on GPU. This is followed by a "fastest wins" algorithm, that runs instances of the target code concurrently on CPU and GPU; the first completed kills the other one.We validate this proposal on the polyhedral benchmark suite, showing that the predictions are accurate and that the runtime selection is effective on two different architectures.

show abstract

Automatic Library Generation for BLAS3 on GPUs

Cited by 25 publications

References 33 publications

Autotuning GEMM Kernels for the Fermi GPU

Autotuning GEMM Kernels for the Fermi GPU

Augem

Adaptive Runtime Selection for GPU

Contact Info

Product

Resources

About