Use of Level 3 Blas in Lu Factorization in a Multiprocessing Environment On Three Vector Multiprocessors: the Alliant Fx/80, the Cray-2, and the Ibm 3090 Vf

Daydé, Michel; Duff, Iain S.

doi:10.1177/109434209100500308

Cited by 14 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This work is a logical continuation of previous studies on the implementation of Level 3 BLAS on a Transputer network (Berger, Dayd e, and Mor ere (1991)) and on the design of a blocked parallel version of Level 3 BLAS for various shared memory vector multiprocessors (see Dayd e and Du (1991), and Dayd e, Du, and Petitet (1992)).…”

Section: Level Blas On the Bbn Tc2000mentioning

confidence: 63%

“…In Table 7.1 we compare the speed-ups obtained on parallel matrix-matrix multiplication on the BBN TC2000 and other shared memory multiprocessors : the Alliant FX/80, the CRAY-2, and the IBM 3090 models E and J (see Dayd e and Du (1991)). The speed-ups obtained on the BBN TC2000 can be successfully compared to those obtained on the other computers even if the performance achieved is not always comparable.…”

Section: Discussionmentioning

confidence: 99%

“…To create more parallelism during a node process and to use data locality better, we have modied the algorithm used in version 3. We overlap the parallel updating involved at step k of the KJI-SAXPY scheme with the elimination of the block row at step k+1 in a similar way to that done by Dayd e and Du (1991). Based on the value of the blocking parameters introduced for version 3 (not modied in version 4), we know during the assembly phase if the node elimination step will create parallel tasks.…”

Section: 4mentioning

confidence: 99%

“…The parallelism from the elimination tree can thus be combined with a nodelevel parallelism that exploits ideas used in full linear algebra (see Section 4 and Amestoy, Dayd e, and Du (1989)). For the elimination process, we use a row oriented (frontal matrices are stored by row) adaptation of the right-looking variant (KIJ-SAXPY) (see for example Dayd e and Du (1991)). At the k-th step of this block form, a block row of the factors is computed and the corresponding transformations, based on Level 3 BLAS kernels, are applied to the remaining reduced matrix.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Linear Algebra Calculations on a Virtual Shared Memory Computer

Amestoy

Duff

Daydé

1995

Int. J. High Speed Comp.

View full text Add to dashboard Cite

We evaluate the impact of the memory hierarchy of virtual shared memory computers on the design of algorithms for linear algebra. On classical shared memory multiprocessor computers, block algorithms are used for eciency. We study here the potential and the limitations of such approaches on globally addressable distributed memory computers. The BBN TC2000 belongs to this class of computers and will be used to illustrate our discussion. The BBN TC2000 is a virtual shared memory multiprocessor with up to 512 nodes. Each node contains one RISC processor (a Motorola 88100) and 16 MBytes of memory. The originality of the BBN TC2000 comes from its interconnection network (Buttery switch) and from its globally addressable memory. Memory references can be either remote or local to one node. The memory hierarchy consists of the disks, the remote memory, the local memory of each node, the local cache of the 88100, and the internal registers of the processor. We describe the implementation of Level 3 BLAS and examine the performance of some of the LAPACK routines. The impact of the number of processors with respect to the choice of the variants of classical matrix factorizations (for example KJI, JKI, JIK for the LU factorization) is discussed. We also study the factorization of sparse matrices based on a multifrontal approach. The ideas introduced for the parallelization of full linear algebra codes are applied to the sparse case. We discuss and illustrate the limitations of this approach in sparse multifrontal factorization. We show that the speed-ups obtained on the BBN TC2000 for the class of methods presented here are comparable to those obtained on more classical shared memory computers, such as the Alliant FX/80, the CRAY-2, and the IBM 3090/VF and we explain why our approach can be extended to other virtual shared memory multiprocessors.

show abstract

Section: Level Blas On the Bbn Tc2000mentioning

confidence: 63%

Section: Discussionmentioning

confidence: 99%

Section: 4mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Linear Algebra Calculations on a Virtual Shared Memory Computer

Amestoy

Duff

Daydé

1995

Int. J. High Speed Comp.

View full text Add to dashboard Cite

show abstract

“…On Figures 10 and 9, the effective speed-up are drawn with solid line and the theoretical one with dashed lines. The theoretical speed-up is computed according to the Amdhal law (see [12]):…”

Section: Residuals and Stopping Criteriamentioning

confidence: 99%

Highly nonnormal eigenproblems in the aeronautical industry

Braconnier

Chatelin

Dunyach³

1995

Japan J. Indust. Appl. Math.

View full text Add to dashboard Cite

We considera large-scale nonnormal eigenvalue prob[em that occurs in flutter analysis. Matrices arising in such problems are usually sparse, of large order, and highly nonnormal. We use the incomplete Arnoldi method associated with the Tchebycheff acceleration in order to compute a subset of the eigenvalues and their associated eigenvectors. This method has been parallelized using BLAS kernels and has been tested on various vector and parallel machines. We also studied the stability of the eigenproblem by perturbing the original matrix and verified that the spectral instability increases with the size of the problem. This work has been conducted at CERFACS in cooperation with the Aerospatiale Avions (Structural research and development department).

show abstract