Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Ltaief, Hatem; Łuszczek, Piotr; Dongarra, Jack

doi:10.1007/978-3-642-31464-3_67

Cited by 13 publications

(19 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mixed precisions techniques are also used for further approximating contributions from farther particles in the context of fast multipole methods as well as LQCD computation [13], resulting in speeding up the computation while reducing memory traffic. Moreover, tree reduction techniques were naturally identified as a way to increase concurrency while reducing data motion on multicore architecture [14]- [16]. In this work, we present detailed power analysis of these mixed precision and tree reduction codes.…”

Section: Related Workmentioning

confidence: 99%

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures

Dongarra

Ltaief

Łuszczek

et al. 2012

2012 Second International Conference on Cloud and Green Computing

Self Cite

View full text Add to dashboard Cite

We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving overdetermined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the PowerPack framework and the Running Average Power Limit model, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures

Dongarra

Ltaief

Łuszczek

et al. 2012

2012 Second International Conference on Cloud and Green Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The two-stage approach was applied to the TRD (Triangular Reduction) [34] and to SVD [35,53,54] in combination with tile algorithms and runtime scheduling based on data dependences between tasks that operate on the tiles. This resulted in very good performance but has never been used to compute the singular vectors.…”

Section: Related Workmentioning

confidence: 99%

“…The caveat is that the reductions can be done easily to a band form, instead of the proper bi-diagonal matrix or a tri-diagonal matrix (with a single subdiagonal). The solution is to reduce to the band form first, and then produce the proper form through the process of bulge chasing, i.e., successive elimination of the subdiagonal entries by a series of Householder transformations [34,35,[52][53][54][55]. Because both the reduction to the band form and the bulge chasing process can be implemented in a parallel and cache-efficient manner, the two-stage procedure is an order of magnitude faster than the legacy approach of LAPACK, which relies heavily on Level 2 BLAS operations, is memory bound, and therefore inefficient.…”

Section: Introductionmentioning

confidence: 99%

“…In order to exploit the fine-grained parallelism to its fullest, efficient schedules have to be designed, while data dependencies are preserved, i.e., data hazards are prevented. This has been done for both the simpler single-sided factorizations, such as Cholesky, LU and QR [1, 2, 13, 14, 19-21, 36, 48], as well as the more complicated two-sided factorizations, such as the reductions to band bi-diagonal and band tri-diagonal form [34,35,[52][53][54][55]. The process of constructing such schedules through manipulation of loop indexes and enforcing them by progress tables is tedious and error-prone.…”

Section: Introductionmentioning

confidence: 99%

“…The QR iteration [17,18] is no longer a method of choice for singular vectors because it takes roughly 50% longer than the methods mentioned earlier. The optimization techniques we present here are only applicable to sqaure matrices because a slightly different approach is necessariy for the rectangular ones [54].…”

mentioning

confidence: 99%

See 2 more Smart Citations

An improved parallel singular value algorithm and its implementation for multicore hardware

Haidar

Kurzak

Łuszczek

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance.In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver.Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.

show abstract

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

Dongarra

Faverge

Ltaief

et al. 2013

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. HIGH-PERFORMANCE RECURSIVE TILE LU FACTORIZATION 1409Successful high-performance results have already been reported for one-sided factorizations (e.g., QR/LQ, LU, and Cholesky factorizations) and, more recently, for the tridiagonal reduction needed to solve the symmetric eigenvalue problems [3] and the bidiagonal reduction required for the singular value decomposition [4][5][6]. These implementations are based on tile algorithms, which operate on the original matrix using small square regions called tiles; they alleviated the bottlenecks of and block algorithms on column-major storage by bringing the parallelism to the fore, minimizing the synchronization overhead, and relying on dynamic scheduling of fine-grained tasks. The data within the tiles can be contiguously stored in memory (i.e., tile data layout) or left as is, following the standard column-major format, which makes the data layout completely indep...

show abstract

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

Cited by 13 publications

References 21 publications

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures

An improved parallel singular value algorithm and its implementation for multicore hardware

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

Contact Info

Product

Resources

About