A parallel block implementation of Level-3 BLAS for MIMD vector processors

Daydé, Michel; Duff, Iain S.; Petitet, Antoine

doi:10.1145/178365.174413

Cited by 18 publications

(17 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, for some machines, performance could be enhanced by judiciously selecting appropriate leading dimensions of the matrices (e.g., avoiding powers of 2), although we do not consider this because it is dependent on the machine architecture and cache management strategy. We demonstrated in Daydé et al [1994] how this blocked version could be used to parallelize the Level 3 BLAS. A preliminary version was successfully used for developing both serial and parallel tuned versions of the Level 3 BLAS for a 30-node BBN-TC2000 [Amestoy et al 1995;Daydé and Duff 1995].…”

Section: Resultsmentioning

confidence: 99%

“…Then, we developed a blocked version of the Level 3 BLAS for MIMD vector multiprocessors [Amestoy and Daydé 1993;Daydé et al 1994]. At the same time, we also studied the development of a parallel version of the Level 3 BLAS for Transputers [Berger et al 1991].…”

Section: Motivations and Design Of The Risc Blasmentioning

confidence: 99%

“…This version of the Level 3 BLAS is an evolution of the one described by Daydé et al [1994] for MIMD vector processors. They report on experiments on a range of computers (ALLIANT, CONVEX, IBM, and CRAY) and demonstrate the efficiency of their approach whenever a tuned version of the matrix-matrix multiplication is available.…”

Section: Introductionmentioning

confidence: 99%

“…Because of the success of RISC-based architectures, we have decided to study the design of a version of the Level 3 BLAS that is efficient on RISC processors. This tuned version of the Level 3 BLAS uses the same blocking ideas as in Daydé et al [1994], except that the ordering of loops is designed for efficient reuse of data held in cache. Thus, all the codes are specifically tuned for RISC processors, and the software includes a tuned version of GEMM.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The RISC BLAS

Daydé

Duff

1999

ACM Trans. Math. Softw.

Self Cite

View full text Add to dashboard Cite

We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the authors and colleagues on a similar approach for efficient serial and parallel implementations on virtual-memory and sharedmemory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLAS in terms of operations involving triangular blocks and calls to the matrix-matrix multiplication kernel (GEMM). No manufacturer-supplied or assembler code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loops is designed for efficient reuse of data held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC processors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-based workstations and multiprocessors: CRAY T3D, DEC 8400 5/300, HP 715/64, IBM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UltraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP, and we welcome input from users to improve and extend our BLAS implementation.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Motivations and Design Of The Risc Blasmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The RISC BLAS

Daydé

Duff

1999

ACM Trans. Math. Softw.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, a variety of techniques for memory accesses, such as increasing the cache hit ratio have been proposed [7,6,3]. Blocking is one of these techniques that performs the processing by each partial area (block) [9,10,8,4].…”

Section: Introductionmentioning

confidence: 99%

An efficient technique for corner-turn in SAR image reconstruction by improving cache access

Izumi

Sasaki

Nakajima

et al. 2002

Proceedings 16th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

The performance improvements in SAR(Synthetic Aperture Radar) image reconstruction are one of the key issues in developing practical SAR image processing systems. In order to achieve this goal, we are working to develop an efficient algorithm to process the reconstruction on SMP (Symmetric Multi-processor). In our study, we are focusing on "corner-turn," a subprocess of the reconstruction, which becomes a bottleneck in conventional parallel algorithms because of intensive cache miss hits. We proposed an efficient technique SBCT for parallelizing "corner-turn," reducing cache miss hit. Our new scheme achieves about 25 times speed-up in the parallel "corner-turn" on 8 processors, which contributes to a total performance improvement in SAR image reconstruction of about 20%.

show abstract

Minimizing development and maintenance costs in supporting persistently optimized BLAS

Whaley

Petitet²

2005

Softw: Pract. Exper.

Self Cite

155

124

View full text Add to dashboard Cite

The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines.Linear algebra is rich in operations which are highly optimizable, in the sense that a highly tuned code may run multiple orders of magnitude faster than a naively coded routine. However, these optimizations are platform specific, such that an optimization for a given computer architecture will actually cause a slow-down on another architecture. To handle this problem, a standard API of performance-critical linear algebra kernels was created, called the Basic Linear Algebra Subprograms (BLAS) [1][2][3][4][5], which provides such linear algebra kernels as matrix multiply, triangular solve, etc. Given this API, then, the traditional method of achieving high-performance linear algebra routines called on the high-performance community to produce hand-optimized routines for each new architecture of interest. This is a painstaking process, typically requiring many man-months of highly trained (both in linear algebra and computational optimization) personal. The incredible pace of hardware evolution makes this technique untenable in the long run, particularly so when considering that there are many software layers (e.g. operating systems, compilers, etc.), which also effect these kinds of optimizations, that are changing at similar, but independent rates.A new paradigm is needed, therefore, for the production of highly efficient routines in the modern age of computing, and our own project, Automatically Tuned Linear Algebra Software (ATLAS) [6][7][8][9][10] represents an implementation of such a set of new techniques. We call this paradigm 'Automated Empirical Optimization of Software', or AEOS. In an AEOS-enabled package such as ATLAS, the package provides many ways of doing the required operations, and uses empirical timings in order to choose the best method for a given architecture. Thus, if written generally enough, an AEOS-aware package can automatically adapt to a new computer architecture in a matter of hours, rather than requiring months or even years of highly trained professionals' time, as dictated by traditional methods.Today, ATLAS-tuned libraries represent one of the most widely used BLAS libraries in existence. They are used in problem-solving environments such as MAPLE, MATLAB and Octave, compilers such as Absoft Pro Fortran, as well as a wide variety of operating systems, including OS X, FreeBSD, and most versions of Linux. Finally, ATLAS BLAS are used in a host of softwa...

show abstract

A parallel block implementation of Level-3 BLAS for MIMD vector processors

Cited by 18 publications

References 10 publications

The RISC BLAS

The RISC BLAS

An efficient technique for corner-turn in SAR image reconstruction by improving cache access

Minimizing development and maintenance costs in supporting persistently optimized BLAS

Contact Info

Product

Resources

About