2014
DOI: 10.1007/978-3-642-54420-0_64
|View full text |Cite
|
Sign up to set email alerts
|

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

Abstract: This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for internode communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 21 publications
0
5
0
Order By: Relevance
“…For a square matrix, our flat-tree configuration obtains the performance that is equivalent to that of our first VSA implementation of the QR decomposition (domino QR) [4]. This domino QR not only obtained significant speedups over the ScaLAPACK implementation of the QR decomposition in the Cray LibSci package, but it also obtained the best performance among hierarchical QR factorizations implemented using another runtime system, PaRSEC [5]. In comparison to the domino QR, our flat-tree QR sends packets between the flat-trees through one level of the binary-tree.…”
Section: Performance Resultsmentioning
confidence: 88%
See 1 more Smart Citation
“…For a square matrix, our flat-tree configuration obtains the performance that is equivalent to that of our first VSA implementation of the QR decomposition (domino QR) [4]. This domino QR not only obtained significant speedups over the ScaLAPACK implementation of the QR decomposition in the Cray LibSci package, but it also obtained the best performance among hierarchical QR factorizations implemented using another runtime system, PaRSEC [5]. In comparison to the domino QR, our flat-tree QR sends packets between the flat-trees through one level of the binary-tree.…”
Section: Performance Resultsmentioning
confidence: 88%
“…Though the hardware implementations of such systolic arrays had been haunted by an array of problems, the systolic array becames very attractive as a parallel programming model on modern computers when it is implemented as a software layer like the Virtual Systolic Array (VSA) presented in [4]. Since this discovery, the concepts of 1D, 2D, and 3D systolic arrays as virtualized software designs have been combined with a distributed-memory dataflow runtime and delivered a wide range of scalability results, but with varying levels of achievable performance [5].…”
Section: Introductionmentioning
confidence: 99%
“…Comparison of this implementation of QR against the vendor code (LibSci based on ScaLAPACK in this case) is shown in Figure 9 -it is a strong-scaling test. This domino QR not only obtained significant speedups over the ScaLAPACK implementation of the QR decomposition in the Cray LibSci package, but it also obtained the best performance among hierarchical QR factorizations implemented using another runtime system, PaRSEC [5]. In comparison to the domino QR, our flat-tree QR sends packets between the flat-trees through one level of the binary-tree.…”
Section: Performance Resultsmentioning
confidence: 97%
“…Though the hardware implementations of such systolic arrays had been haunted by a number of problems, the systolic array becames very attractive as a parallel programming model on modern computers when it is implemented as a software layer like the Virtual Systolic Array (VSA) that we presented earlier [4]. Since this discovery, the concepts of 1D, 2D, and 3D systolic arrays as virtualized software designs have been combined with a distributed-memory dataflow runtime and delivered a wide range of scalability results, but with varying levels of achievable performance [5].…”
Section: Introductionmentioning
confidence: 99%
“…In Figure 8 we show a strong scaling experiment on a large scale run. The algorithm tested here is the Systolic QR [5] that is implemented in the DPLASMA library. The QR factorization from LibSCI is included in the graph for reference.…”
Section: Performance Experiences With Parsecmentioning
confidence: 99%