1999
DOI: 10.1145/326147.326150
|View full text |Cite
|
Sign up to set email alerts
|

The RISC BLAS

Abstract: We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the authors and colleagues on a similar approach for efficient serial and parallel implementations on virtual-memory and sharedmemory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLAS in terms of operations involving triangular blocks and calls to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2005
2005
2010
2010

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 14 publications
0
7
0
Order By: Relevance
“…Notice that because the algorithm accesses tiles of the adjacency matrix, a cache-aware layout can store such tiles continuously in memory improving the cache behavior of the algorithm. Such a layout reduces self/inter interference, therefore, the cache conflicts further (see also [23]- [25]). …”
Section: A Recursive Dandc Algorithm R-kleenementioning
confidence: 99%
“…Notice that because the algorithm accesses tiles of the adjacency matrix, a cache-aware layout can store such tiles continuously in memory improving the cache behavior of the algorithm. Such a layout reduces self/inter interference, therefore, the cache conflicts further (see also [23]- [25]). …”
Section: A Recursive Dandc Algorithm R-kleenementioning
confidence: 99%
“…The RISC-BLAS library [8] is written in Fortran, was optimized by hand using unroll-and-jam, loop tiling and data copying [23] and is specifically tuned for RISC processors. On the MIPS and ALPHA 21264 platforms, the library tiles for the L1 cache level, while for the ALPHA 21164, the library ignores the small (8Kb) first level cache and tiles for the L2 on-chip cache level.…”
Section: The Risc-blas Versionmentioning
confidence: 99%
“…We evaluated six different versions of each benchmark program: one is the original code as proposed in [10] with no restructuring transformation (ORI-blas); the second one calls the manufacturer-supplied BLAS3 library to perform the operation (VENDOR-blas); the third one calls the RISC-BLAS library [8] (RISC-blas); the fourth one is the code after tiling for both cache and register levels using our own developed tool (TCRL); and the last two versions are the codes after tiling only for the cache level (TCL) and only for the register level (TRL). We use these later versions to show the effects of tiling for each individual level.…”
Section: Program Versionsmentioning
confidence: 99%
See 2 more Smart Citations