1994
DOI: 10.1145/178365.174413
|View full text |Cite
|
Sign up to set email alerts
|

A parallel block implementation of Level-3 BLAS for MIMD vector processors

Abstract: We describe an implementation of Level-3 BLAS (Basic Linear Algebra Subprograms) based on the use of the matrix-matrix multiplication kernel (GEMM). Blocking techniques are used to express the BLAS in terms of operations involving triangular blocks and calls to GEMM. A principal advantage of this approach is that most manufacturers provide at least an efficient serial version of GEMM so that our implementation can capture a significant percentage of the computer performance. A parameter which controls the bloc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

1994
1994
2010
2010

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 10 publications
0
17
0
Order By: Relevance
“…Finally, for some machines, performance could be enhanced by judiciously selecting appropriate leading dimensions of the matrices (e.g., avoiding powers of 2), although we do not consider this because it is dependent on the machine architecture and cache management strategy. We demonstrated in Daydé et al [1994] how this blocked version could be used to parallelize the Level 3 BLAS. A preliminary version was successfully used for developing both serial and parallel tuned versions of the Level 3 BLAS for a 30-node BBN-TC2000 [Amestoy et al 1995;Daydé and Duff 1995].…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…Finally, for some machines, performance could be enhanced by judiciously selecting appropriate leading dimensions of the matrices (e.g., avoiding powers of 2), although we do not consider this because it is dependent on the machine architecture and cache management strategy. We demonstrated in Daydé et al [1994] how this blocked version could be used to parallelize the Level 3 BLAS. A preliminary version was successfully used for developing both serial and parallel tuned versions of the Level 3 BLAS for a 30-node BBN-TC2000 [Amestoy et al 1995;Daydé and Duff 1995].…”
Section: Resultsmentioning
confidence: 99%
“…Then, we developed a blocked version of the Level 3 BLAS for MIMD vector multiprocessors [Amestoy and Daydé 1993;Daydé et al 1994]. At the same time, we also studied the development of a parallel version of the Level 3 BLAS for Transputers [Berger et al 1991].…”
Section: Motivations and Design Of The Risc Blasmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore, a variety of techniques for memory accesses, such as increasing the cache hit ratio have been proposed [7,6,3]. Blocking is one of these techniques that performs the processing by each partial area (block) [9,10,8,4].…”
Section: Introductionmentioning
confidence: 99%