A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

Agullo, Emmanuel; Augonnet, Cédric; Dongarra, Jack; Ltaief, Hatem; Namyst, Raymond; Thibault, Samuel; Tomov, Stanimire

doi:10.1016/b978-0-12-385963-1.00034-4

Cited by 56 publications

(91 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Any reduction in the completion of POTRF tasks allows other cores to resume their execution, thus reducing idle time. The same phenomenum was observed on multiCPUs/multi-GPUs factorization [2], [15], [28] where POTRF tasks are inefficient on GPU, thus to decrease execution time, tasks belonging to the critical path have to be parallelized. On multi-CPUs/multi-GPUs this was done by executing PORTF tasks on CPUs.…”

Section: ) Scalabilitymentioning

confidence: 77%

See 1 more Smart Citation

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Lima

Broquedis

Gautier

et al. 2013

2013 25th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads.

show abstract

Section: ) Scalabilitymentioning

confidence: 77%

“…The Cholesky factorization (POTRF) decomposes an n × n real symmetric positive definite matrix A into the form A = LL T where L is an n × n real lower triangular matrix with positive diagonal elements [28]. Figure 3 shows the pseudo-code of both the XKaapi and the CilkPlus versions.…”

Section: Choleskymentioning

confidence: 99%

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Lima

Broquedis

Gautier

et al. 2013

2013 25th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

show abstract

“…The Cholesky factorization of an n × n matrix consumes about 1 3 n 3 FLOPs. Thus, with respect to Equation (2), in double precision, we can expect the right-looking version to have an asymptotic performance upper bound of…”

Section: Algorithmmentioning

confidence: 99%

“…The motivation came from the fact that the GPU's compute power cannot be used on a panel factorization as efficiently as it can on trailing matrix updates [39]. As a result, various hybrid algorithms were developed-where the panels are factorized on the CPU, while the GPU is used for trailing matrix updates (mostly GEMMs) [2], [14]. For large-enough problems, the panel factorizations and associated CPU-GPU data transfers can be overlapped with GPU work.…”

Section: Related Workmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

“…In this paper, we propose to accelerate the LU factorization on a multicore node enhanced with multiple GPU accelerators. We follow a methodology previously employed in the context of the Cholesky factorization [3] and QR factorization [4] that we apply to the tile LU decomposition algorithm [1]. We bring four contributions.…”

Section: Introductionmentioning

confidence: 99%

LU factorization for accelerator-based systems

Agullo

Augonnet

Dongarra

et al. 2011

2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA)

Self Cite

View full text Add to dashboard Cite

Abstract-Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit the potential of such platforms in spite of their complexity. We use a methodology derived from previous work on Cholesky and QR factorizations. Our contributions essentially consist of providing new CPU/GPU hybrid LU kernels, studying the impact on performance of the looking variants as well as the storage layout in presence of pivoting, tuning the kernels for two different machines composed of multiple recent NVIDIA Tesla S1070 (four GPUs total) and Fermi-based S2050 GPUs (three GPUs total), respectively. The hybrid tile LU asymptotically achieves 1 Tflop/s in single precision on both hardwares. The performance in double precision arithmetic reaches 500 Gflop/s on the Fermi-based system, twice faster than the old GPU generation of Tesla S1070. We also discuss the impact of the number of tiles on the numerical stability. We show that the numerical results of the tile LU factorization will be accurate enough for most applications as long as the computations are performed in double precision arithmetic.

show abstract

A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

Cited by 56 publications

References 5 publications

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

LU factorization for accelerator-based systems

Contact Info

Product

Resources

About