Level 3 Blas in Lu Factorization On the Cray-2, Eta-10P, and Ibm 3090-200/Vf

Daydé, Michel; Duff, Lain S.

doi:10.1177/109434208900300204

Cited by 20 publications

(12 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The RISC BLAS • We considered the blocking of the triangular solver from the Level 3 BLAS-TRSM-in Daydé and Duff [1989]. Then, we developed a blocked version of the Level 3 BLAS for MIMD vector multiprocessors [Amestoy and Daydé 1993;Daydé et al 1994].…”

Section: Motivations and Design Of The Risc Blasmentioning

confidence: 99%

The RISC BLAS

Daydé

Duff

1999

ACM Trans. Math. Softw.

Self Cite

View full text Add to dashboard Cite

We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the authors and colleagues on a similar approach for efficient serial and parallel implementations on virtual-memory and sharedmemory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLAS in terms of operations involving triangular blocks and calls to the matrix-matrix multiplication kernel (GEMM). No manufacturer-supplied or assembler code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loops is designed for efficient reuse of data held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC processors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-based workstations and multiprocessors: CRAY T3D, DEC 8400 5/300, HP 715/64, IBM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UltraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP, and we welcome input from users to improve and extend our BLAS implementation.

show abstract

Section: Motivations and Design Of The Risc Blasmentioning

confidence: 99%

The RISC BLAS

Daydé

Duff

1999

ACM Trans. Math. Softw.

Self Cite

View full text Add to dashboard Cite

show abstract

“…First, let us summarize our conclusions from a previous report (Dayde and Duff 1989) on the implementation of block LU factorization on one processor of the CRAY-2, the ETAlO-P, and the IBM 3090 vector processors. KJI-SAXPY and JKI-GAXPY have similar performance using the Fortran model implementation or the tuned versions of Level 2 and Level 3 BLAS.…”

Section: Comparison Of the Block Factorization Variantsmentioning

confidence: 99%

“…The aim of this work is to show that, based on the use of Level 3 BLAS kernels, portable and efficient code can be designed for parallel vector computers with a global shared memory, extending discussions in Dayde and Duff (1989) This class of computer architecture is widely used in the design of today's supercomputers including the ALLIANT FX/80, the CRAY-2. and the IBM 3090.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Use of parallel level 3 BLAS in LU factorization on three vector multiprocessors the ALLIANT FX/80, the CRAY-2, and the IBM 3090 VF

Daydé¹,

Duff²

1990

Proceedings of the 4th International Conference on Supercomputing

View full text Add to dashboard Cite

“…In the blocked jik-SDOT [7], one block column of L and one block row of U are computed in each iteration. The basic steps involved in the jth iteration are shown in Figure 4 along with the data dependencies involved in each step.…”

Section: Parallel Blocked Jik-sdotmentioning

confidence: 99%

On the parallelization of blocked LU factorization algorithms on distributed memory architectures

Laszewski¹,

Parashar²,

Mohamed³

et al.

Proceedings Supercomputing '92

View full text Add to dashboard Cite

Solutions t o systems of linear equations and specif-ically, the LU factorization of matrices form the computational core of many scientific and engineering applications. In this paper, we present the parallelization of blocked algorithms for LU factorization. W e isolate problems inherent t o sequential blocked algorithms and provide approaches to overcome them on distributed memory architectures. The performance of the parallelized versions of three blocked algorithms suited t o column oriented Fortran is compared. Experiments are performed on the iPSC/SSO Hypercube. Our study shows that it is not intuitively clear which algorithm might perform best on a given architecture, but is dependent on the problem site and the number of available processors.

show abstract

Level 3 Blas in Lu Factorization On the Cray-2, Eta-10P, and Ibm 3090-200/Vf

Cited by 20 publications

References 13 publications

The RISC BLAS

The RISC BLAS

Use of parallel level 3 BLAS in LU factorization on three vector multiprocessors the ALLIANT FX/80, the CRAY-2, and the IBM 3090 VF

On the parallelization of blocked LU factorization algorithms on distributed memory architectures

Contact Info

Product

Resources

About