GPU Computing Gems Jade Edition 2012
DOI: 10.1016/b978-0-12-385963-1.00034-4
|View full text |Cite
|
Sign up to set email alerts
|

A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
90
0
1

Year Published

2013
2013
2023
2023

Publication Types

Select...
5
2
1

Relationship

5
3

Authors

Journals

citations
Cited by 56 publications
(91 citation statements)
references
References 5 publications
0
90
0
1
Order By: Relevance
“…Any reduction in the completion of POTRF tasks allows other cores to resume their execution, thus reducing idle time. The same phenomenum was observed on multiCPUs/multi-GPUs factorization [2], [15], [28] where POTRF tasks are inefficient on GPU, thus to decrease execution time, tasks belonging to the critical path have to be parallelized. On multi-CPUs/multi-GPUs this was done by executing PORTF tasks on CPUs.…”
Section: ) Scalabilitymentioning
confidence: 77%
See 1 more Smart Citation
“…Any reduction in the completion of POTRF tasks allows other cores to resume their execution, thus reducing idle time. The same phenomenum was observed on multiCPUs/multi-GPUs factorization [2], [15], [28] where POTRF tasks are inefficient on GPU, thus to decrease execution time, tasks belonging to the critical path have to be parallelized. On multi-CPUs/multi-GPUs this was done by executing PORTF tasks on CPUs.…”
Section: ) Scalabilitymentioning
confidence: 77%
“…The Cholesky factorization (POTRF) decomposes an n × n real symmetric positive definite matrix A into the form A = LL T where L is an n × n real lower triangular matrix with positive diagonal elements [28]. Figure 3 shows the pseudo-code of both the XKaapi and the CilkPlus versions.…”
Section: Choleskymentioning
confidence: 99%
“…The Cholesky factorization of an n × n matrix consumes about 1 3 n 3 FLOPs. Thus, with respect to Equation (2), in double precision, we can expect the right-looking version to have an asymptotic performance upper bound of…”
Section: Algorithmmentioning
confidence: 99%
“…The motivation came from the fact that the GPU's compute power cannot be used on a panel factorization as efficiently as it can on trailing matrix updates [39]. As a result, various hybrid algorithms were developed-where the panels are factorized on the CPU, while the GPU is used for trailing matrix updates (mostly GEMMs) [2], [14]. For large-enough problems, the panel factorizations and associated CPU-GPU data transfers can be overlapped with GPU work.…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we propose to accelerate the LU factorization on a multicore node enhanced with multiple GPU accelerators. We follow a methodology previously employed in the context of the Cholesky factorization [3] and QR factorization [4] that we apply to the tile LU decomposition algorithm [1]. We bring four contributions.…”
Section: Introductionmentioning
confidence: 99%