Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Song, Fengguang; Ltaief, Hatem; Hadri, Bilel; Dongarra, Jack

doi:10.1109/sc.2010.48

Cited by 25 publications

(20 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…And the same apply equally to GPU-accelerated implementations [2] as well as the codes designed specifically for distributed memory clusters of multicore nodes [20].…”

Section: Related Work and Relevant Contributionsmentioning

confidence: 93%

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Ltaief

Łuszczek

Dongarra

2012

Parallel Processing and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

Abstract. The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.

show abstract

“…And the same apply equally to GPU-accelerated implementations [2] as well as the codes designed specifically for distributed memory clusters of multicore nodes [20].…”

Section: Related Work and Relevant Contributionsmentioning

confidence: 93%

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Ltaief

Łuszczek

Dongarra

2012

Parallel Processing and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…First, several local binary trees are applied in parallel, one within each node, and then a global binary tree is applied for the final reduction across nodes. Yet another implementation [19] also uses a hierarchical approach, and it also uses a 1D block distribution. The main difference is that the first level of reduction is performed with a flat tree within each node.…”

Section: Related Workmentioning

confidence: 99%

“…The main difference is that the first level of reduction is performed with a flat tree within each node. Note that the hierarchical algorithm (HQR) used previously [5] can be parametrized to implement this original algorithm [19] as well as a more efficient variant with cyclic layout. The HQR algorithm [5] is the reference algorithm for multilevel clusters: it provides a flexible approach, and allows one to use various elimination trees (Flat, Binary, Fibonacci or Greedy) at each level.…”

Section: Related Workmentioning

confidence: 99%

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

Aupy

Faverge

Robert

et al. 2014

Euro-Par 2013: Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for internode communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures.

show abstract

“…This section briefly introduces the previous work of distributed tiled CAQR [9] and its existing performance bottleneck.…”

Section: Introductionmentioning

confidence: 99%

suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework

Zheng

Song

Lin

2016

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

Abstract-The scope of this paper is to design and implement a scalable QR factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR factorization (suCAQR), introduces a simplified and tuning-less way to realize the communicationavoiding QR factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, a dynamic dataflow implementation, and an analytical model to determine an optimal number of factorization domains. Compared with the existing communication avoiding QR factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.Index Terms-high performance computing; computational science application; performance modeling and optimization; dataflow runtime system.

show abstract

Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Cited by 25 publications

References 11 publications

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework

Contact Info

Product

Resources

About