Self-Adapting Linear Algebra Algorithms and Software

Demmel, James; Dongarra, Jack; Eijkhout, Victor; Fuentes, Erika; Petitet, Antoine; Vuduc, Richard; Whaley, R. Clint; Yelick, Katherine

doi:10.1109/jproc.2004.840848

Cited by 162 publications

(104 citation statements)

References 55 publications

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…One approach (that we explore) to systematically resolve these issues is the use of autotuning, a technique that in the context of OpenCL would involve collecting and generating multiple kernel versions, implementing the same algorithm optimized for different architectures, and heuristically selecting the best performing one. Autotuning has been used intensively on CPUs in the past to address these challenges to automatically generate near optimal numerical libraries, e.g., ATLAS [18,19] and PHiPAC [20] used it to generate highly optimized BLAS. Work on auto-tuning CUDA kernels for NVIDIA GPUs [21,22] has shown that the technique is a very practical solution to easily port existing algorithmic solutions on quickly evolving GPU architectures and to substantially speed up even highly tuned hand-written kernels.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

View full text Add to dashboard Cite

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers' optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels' parameter space using search harness.

show abstract

Section: Related Workmentioning

confidence: 99%

“…For example, ATLAS [18,19] and PHiPAC [20] are used to generate highly optimized BLAS. The main approach for doing autotuning is based on empirical optimization techniques.…”

Section: Performance Portability With Auto-tuningmentioning

confidence: 99%

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

et al. 2012

View full text Add to dashboard Cite

show abstract

“…WHT 2 n = (DFT 2 ⊗I 2 n−1 )(I 2 ⊗ WHT 2 n−1 ) (recursive) (18) (c) Determine the exact operations counts (again, additions and multiplications separately) of both algorithms. Also determine the degree of reuse as defined in Section 2.1.…”

Section: Exercisesmentioning

confidence: 99%

“…This way, the library can dynamically adapt to the computer's memory hierarchy. Sparsity and OSKI from the BeBOP group [17][18][19][20] is are other examples of such a libraries, used for sparse linear algebra problems.…”

Section: Introductionmentioning

confidence: 99%

“…They are used to generate either crucial components, or libraries in their entirety. For instance, ATLAS (Automatically Tuned Linear Algebra Software) and its predecessor PHiPAC [21,18,22] generate the kernel code for MMM and other basic matrix routines. They do so by generating many different variants arising from different choices of blocking, loop unrolling, and instruction ordering.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

How to Write Fast Numerical Code: A Small Introduction

Chellappa¹,

Franchetti²,

Püschel³

2008

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

show abstract

Cache‐oblivious matrix algorithms in the age of multicores and many cores

Heinecke

Trinitis

2012

Concurrency and Computation

View full text Add to dashboard Cite

SummaryThis article highlights the issue of upcoming wider single‐instruction, multiple‐data units as well as steadily increasing core counts on contemporary and future processor architectures. We present the recent port to and latest results of cache‐oblivious algorithms and implementations of our TifaMMy code on four architectures: SGI's UltraViolet distributed shared‐memory machine, Intel's latest x86 architecture code‐named Sandy Bridge, AMD's new Bulldozer architecture, and Intel's future Many Integrated Core architecture. TifaMMy's matrix multiplication and LU decomposition routines have been adapted and tuned with regard to these architectures. Results are discussed and compared with vendors’ architecture‐specific and optimized libraries, Math Kernel Library and AMD Core Math Library, for both a standard C++ version with vectorization compiler switches and TifaMMy's highly optimized vector intrinsics version. We provide insights into architectural properties and comment on the feasibility of heterogeneous cores and accelerators, namely graphics processing units. Besides bare‐metal performance, the test platforms’ ease of use is analyzed in detail, and the portability of our approach to new and upcoming silicon is discussed with regard to required effort on code change abstraction levels.As a result, we demonstrate that because of its generic structure in terms of memory organization, TifaMMy executes with equally efficient performance on all four architectures as it automatically adapts itself to architectural parameters without losing performance against the Math Kernel Library and AMD Core Math Library, underlining its generic and cache‐oblivious properties, as the porting effort was relatively low compared with that in other implementations.Copyright © 2012 John Wiley & Sons, Ltd.

show abstract

Self-Adapting Linear Algebra Algorithms and Software

Cited by 162 publications

References 55 publications

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

How to Write Fast Numerical Code: A Small Introduction

Cache‐oblivious matrix algorithms in the age of multicores and many cores

Contact Info

Product

Resources

About