The RISC BLAS

Daydé, Michel; Duff, Iain S.

doi:10.1145/326147.326150

Cited by 4 publications

(7 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notice that because the algorithm accesses tiles of the adjacency matrix, a cache-aware layout can store such tiles continuously in memory improving the cache behavior of the algorithm. Such a layout reduces self/inter interference, therefore, the cache conflicts further (see also [23]- [25]). …”

Section: A Recursive Dandc Algorithm R-kleenementioning

confidence: 99%

R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks

D'Alberto

Nicolau

2007

Algorithmica

View full text Add to dashboard Cite

We propose a novel divide-and-conquer algorithm for the solution of the all-pair shortest-path problem for directed and dense graphs with no negative cycles. We propose R-Kleene, a compact and in-place recursive algorithm inspired by Kleene's algorithm. R-Kleene delivers a better performance than previous algorithms for randomly generated graphs represented by highly dense adjacency matrices, in which the matrix components can have any integer value. We show that R-Kleene, unchanged and without any machine tuning, yields consistently between 1 7 and 1 2 of the peak performance running on five very different uniprocessor systems. Introduction.The all-pair shortest-paths problem (APSP) is a well-studied and basic problem in graph theory but it is also a crucial and real problem in large networks such as sensor networks, switch networks or complex targeting systems.Consider the scenario where many thousands of nodes are located across a large area and every node has a processor with little memory space and computational power. In this scenario the computation of APSP is neither feasible nor practical by a single node, nonetheless it is a key feature for efficient data routing and broadcasting. Despite the node-processor computational/memory limitations, a node in the network is able to determine the locations and distances of its neighbors rather easily. Such local information can be coded, sent on the network and collected by an observer node such as a satellite, a global router or a computer cluster. Then the observer node may construct the adjacency matrix, compute the solution and send the result back on the network where each node will store the necessary local information.Any network is naturally represented by a directed graph and we formalize APSP as follows. Given a graph G = (V, E) where V is a set of nodes and E is a set of directed edges, we label every node in the graph by an integer ι ∈ [0, n − 1] where n = |V | (n = |V | is the cardinality of the set V ), and an edge in E is defined by a unique ordered pair of integers (i, j) with i, j ∈ [0, n − 1]. In fact, we assume that there is at most one directed edge connecting two nodes and, therefore, the graph has

show abstract

Section: A Recursive Dandc Algorithm R-kleenementioning

confidence: 99%

R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks

D'Alberto

Nicolau

2007

Algorithmica

View full text Add to dashboard Cite

show abstract

“…The RISC-BLAS library [8] is written in Fortran, was optimized by hand using unroll-and-jam, loop tiling and data copying [23] and is specifically tuned for RISC processors. On the MIPS and ALPHA 21264 platforms, the library tiles for the L1 cache level, while for the ALPHA 21164, the library ignores the small (8Kb) first level cache and tiles for the L2 on-chip cache level.…”

Section: The Risc-blas Versionmentioning

confidence: 99%

“…We evaluated six different versions of each benchmark program: one is the original code as proposed in [10] with no restructuring transformation (ORI-blas); the second one calls the manufacturer-supplied BLAS3 library to perform the operation (VENDOR-blas); the third one calls the RISC-BLAS library [8] (RISC-blas); the fourth one is the code after tiling for both cache and register levels using our own developed tool (TCRL); and the last two versions are the codes after tiling only for the cache level (TCL) and only for the register level (TRL). We use these later versions to show the effects of tiling for each individual level.…”

Section: Program Versionsmentioning

confidence: 99%

“…This can be explained because the RISC-blas codes are heavily optimized for the cache level and perform data copying [8]. Yet, even for large problem sizes, RISC-blas codes seldom ever outperform TCRL.…”

Section: Risc-blas Versus Tcrlmentioning

confidence: 99%

“…This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in [8]. Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors.…”

mentioning

confidence: 99%

See 2 more Smart Citations

On the performance of hand vs. automatically optimized numerical codes

Jimenez

Llaberia²,

Fernandez³

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

View full text Add to dashboard Cite

In this paper, we compare automatic-optimized codes against hand-optimized codes. The automatic-optimized codes have been generated using our own developed tool that implements compiler techniques proposed in our previous work. Our compiler techniques focus on applying multilevel tiling to non-rectangular loop nests. This type of loop nests are commonly found in linear algebra algorithms, typically used in numerical codes. As hand-optimized codes, we use two different numerical libraries: the BLAS3 library provided by the manufacturers and the RISC-BLAS library proposed in [8]. Results will show how compiler technology can make it possible for non-rectangular loop nests to achieve as high performance as hand-optimized codes on modern microprocessors. MotivationExisting compiler technology is oriented mostly towards simple numerical codes containing loop nests that describe rectangular iteration spaces [4][20] [22][24]. This is understandable since transformations are easy to apply on this type of loop nests. However, several linear algebra algorithms also contain complex loop nests defining non-rectangular iteration spaces and current commercial compilers are unable to restructure and optimize these types of codes.This fact has led many programmers to restructure their algorithms by hand to perform well on particular architectures, a situation that has led to machine-specific programs. Additionally, manufacturers have tried to minimize the complexity of writing optimized codes by providing numerical libraries that attain high performance under their particular machine. The BLAS3 library [10], for example, provides a set of standard linear algebra operations. On top of the BLAS standard interface, higher level library packages such as LAPACK [2] have been built. However, not all applications can take advantage of these libraries and there are many situations in which none of the routines provided can specifically solve the task at hand. We believe that restructuring a code should be the job of the compiler. Compilers should handle the machine-specific details required to attain high performance on each particular architecture.To illustrate how current commercial compilers achieve poor performance on non-rectangular loop nests, Fig. 1 shows the performance (in Mflop/s) obtained by the linear algebra problems SGEMM and STRMM, varying the problem size. SGEMM consists of a very simple rectangular loop nest, performing a rectangular matrix multiply while STRMM consists of a non-rectangular loop nest, performing also a matrix multiply but with one of the matrices being triangular. The circle curves show the performance obtained if we directly compile the codes using the f77 compiler with maximum level of optimization. The triangle curves show the performance obtained if we call the vendor-optimized BLAS3 library [10] to perform the operations. We can see how in non-rectangular loop nests (STRMM) current compilers achieve poor performance compared with the hand-optimized code provided by the BLAS3 library. By contrast, i...

show abstract

Efficient sparse matrix vector multiplication using compressed graph

Lee

2010

Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon)

View full text Add to dashboard Cite

The RISC BLAS

Cited by 4 publications

References 14 publications

R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks

R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks

On the performance of hand vs. automatically optimized numerical codes

Efficient sparse matrix vector multiplication using compressed graph

Contact Info

Product

Resources

About