Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

Agarwal, R.; Gustavson, Fred G.; Zubair, Mohammad

doi:10.1147/rd.385.0563

Cited by 66 publications

(46 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For computations, where data is reused many times, this technique reduces memory traffic to slower memories in the hierarchy [Hennessy and Patterson 2007]. The cache blocking technique has been extensively applied to linear algebra applications [Dongarra et al 1990;Anderson et al 1999;Kågström et al 1998;Gupta et al 1998;Goto and van de Geijn 2008;Agarwal et al 1994a]. Since accessing data from a slower memory is expensive, an algorithm that rarely goes to slower memory performs better.…”

Section: Memory Hierarchiesmentioning

confidence: 99%

Cache-optimal algorithms for option pricing

Savage

Zubair

2010

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have near-optimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4-level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70% of peak performance.

show abstract

Section: Memory Hierarchiesmentioning

confidence: 99%

Cache-optimal algorithms for option pricing

Savage

Zubair

2010

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

show abstract

“…For the DGEMM routine, we have found that a 4 by 2 unrolling matches well our estimation of number of loads. Note that this is also the unrolling level used on the IBM POWER2 [1], which ensures that the multiple functional units are fully utilized.…”

Section: Matrixmentioning

confidence: 99%

Towards an accurate performance modeling of parallel sparse factorization

Grigori

2007

AAECC

View full text Add to dashboard Cite

We present a performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. Our model characterizes the algorithmic behavior by taking account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the SuperLU DIST linear system solver, the sparse matrices from real applications, and an IBM POWER3 parallel machine. Our modeling methodology can be easily adapted to study performance of other types of sparse factorizations, such as Cholesky or QR.

show abstract

“…The peak floating-point performance of POWER2-based nodes is 266 million operations per second, thanks to two floating-point functional units that can each execute a multiply-add operation in every cycle. The high bandwidth between the register file and the cache, as well as the high bandwidth of the main memory system, enable the nodes to achieve near-peak performance on many dense-matrix operations [Agarwal et al 1994], including all the block operations that our solver uses. SP2 nodes with 128-and 256-bit-wide buses have an even higher main memory bandwidth, which increases the performance of both intraprocessor and interprocessor data transfers.…”

Section: Performance Of the Solvermentioning

confidence: 99%

“…Communication between the functional units of the same CPU is fast and incurs no overhead. The primitive block operations that our solver uses therefore take advantage of the multiple functional units, so they are parallelized as well by using so-called functional parallelism [Agarwal et al 1994] at the instruction level.…”

Section: Introductionmentioning

confidence: 99%

The design, implementation, and evaluation of a symmetric banded linear solver for distributed-memory parallel computers

Gupta

Gustavson

Joshi

et al. 1998

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

This article describes the design, implementation, and evaluation of a parallel algorithm for the Cholesky factorization of symmetric banded matrices. The algorithm is part of IBM's Parallel Engineering and Scientific Subroutine Library version 1.2 and is compatible with ScaLAPACK's banded solver. Analysis, as well as experiments on an IBM SP2 distributedmemory parallel computer, shows that the algorithm efficiently factors banded matrices with wide bandwidth. For example, a 31-node SP2 factors a large matrix more than 16 times faster than a single node would factor it using the best sequential algorithm, and more than 20 times faster than a single node would using LAPACK's DPBTRF. The algorithm uses novel ideas in the area of distributed dense-matrix computations that include the use of a dynamic schedule for a blocked systolic-like algorithm and the separation of the input and output data layouts from the layout the algorithm uses internally. The algorithm also uses known techniques such as blocking to improve its communication-to-computation ratio and its data-cache behavior.

show abstract

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

Cited by 66 publications

References 17 publications

Cache-optimal algorithms for option pricing

Cache-optimal algorithms for option pricing

Towards an accurate performance modeling of parallel sparse factorization

The design, implementation, and evaluation of a symmetric banded linear solver for distributed-memory parallel computers

Contact Info

Product

Resources

About