1994
DOI: 10.1147/rd.385.0563
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

Abstract: We describe the algorithms and architecture approach to produce high-performance codes for numerically intensive computations. In this approach, for a given computation, we design algorithms so that they perform optimally when run on a target machine-in this case, the new P0WER2™ machines from the RS/6000 family of RISC processors. The algorithmic features that we emphasize are functional parallelism, cache/register blocking, algorithmic prefetching, loop unrolling, and algorithmic restructuring. The architect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

1998
1998
2010
2010

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(46 citation statements)
references
References 17 publications
0
46
0
Order By: Relevance
“…For computations, where data is reused many times, this technique reduces memory traffic to slower memories in the hierarchy [Hennessy and Patterson 2007]. The cache blocking technique has been extensively applied to linear algebra applications [Dongarra et al 1990;Anderson et al 1999;Kågström et al 1998;Gupta et al 1998;Goto and van de Geijn 2008;Agarwal et al 1994a]. Since accessing data from a slower memory is expensive, an algorithm that rarely goes to slower memory performs better.…”
Section: Memory Hierarchiesmentioning
confidence: 99%
“…For computations, where data is reused many times, this technique reduces memory traffic to slower memories in the hierarchy [Hennessy and Patterson 2007]. The cache blocking technique has been extensively applied to linear algebra applications [Dongarra et al 1990;Anderson et al 1999;Kågström et al 1998;Gupta et al 1998;Goto and van de Geijn 2008;Agarwal et al 1994a]. Since accessing data from a slower memory is expensive, an algorithm that rarely goes to slower memory performs better.…”
Section: Memory Hierarchiesmentioning
confidence: 99%
“…For the DGEMM routine, we have found that a 4 by 2 unrolling matches well our estimation of number of loads. Note that this is also the unrolling level used on the IBM POWER2 [1], which ensures that the multiple functional units are fully utilized.…”
Section: Matrixmentioning
confidence: 99%
“…The peak floating-point performance of POWER2-based nodes is 266 million operations per second, thanks to two floating-point functional units that can each execute a multiply-add operation in every cycle. The high bandwidth between the register file and the cache, as well as the high bandwidth of the main memory system, enable the nodes to achieve near-peak performance on many dense-matrix operations [Agarwal et al 1994], including all the block operations that our solver uses. SP2 nodes with 128-and 256-bit-wide buses have an even higher main memory bandwidth, which increases the performance of both intraprocessor and interprocessor data transfers.…”
Section: Performance Of the Solvermentioning
confidence: 99%
“…Communication between the functional units of the same CPU is fast and incurs no overhead. The primitive block operations that our solver uses therefore take advantage of the multiple functional units, so they are parallelized as well by using so-called functional parallelism [Agarwal et al 1994] at the instruction level.…”
Section: Introductionmentioning
confidence: 99%