Proceedings of the General Purpose GPUs 2017
DOI: 10.1145/3038228.3038237
|View full text |Cite
|
Sign up to set email alerts
|

High-performance Cholesky factorization for GPU-only execution

Abstract: We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures li… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 21 publications
0
7
0
Order By: Relevance
“…Hybrid CPU-GPU algorithms, for instance, incur memory transfer and synchronization overhead. For smaller batch sizes, GPU-only implementations have been shown to offer better overall energy efficiency on mixed parallel/serial algorithms [Haidar et al 2017]. Likewise, the latency of GPU tasks and data communication has been shown as an important factor affecting hybrid performance [Wong and Aamodt 2009].…”
Section: Background 21 Revisiting Closely-coupled Parallel Acceleratorsmentioning
confidence: 99%
See 1 more Smart Citation
“…Hybrid CPU-GPU algorithms, for instance, incur memory transfer and synchronization overhead. For smaller batch sizes, GPU-only implementations have been shown to offer better overall energy efficiency on mixed parallel/serial algorithms [Haidar et al 2017]. Likewise, the latency of GPU tasks and data communication has been shown as an important factor affecting hybrid performance [Wong and Aamodt 2009].…”
Section: Background 21 Revisiting Closely-coupled Parallel Acceleratorsmentioning
confidence: 99%
“…Tino et al platforms has a cost, both in programmability and efficiency. For instance, the energy efficiency benefits of heterogeneous systems are negated due to the communication and synchronization overhead incurred by hybrid algorithms [Haidar et al 2017]. Likewise, the serial performance provided by GPUs demonstrates an impact on overall system performance [Wong and Aamodt 2009].…”
Section: Introductionmentioning
confidence: 99%
“…The residual column vector ( 0 ) , treated computationally as a 6 -length real array, is computed based on Eqs. (5), (7) and (8). Tensors 1 and 2 that we obtained in the first stage are required for this procedure.…”
Section: Computing the Residual Column Vectormentioning
confidence: 99%
“…Still, the evolution of those algorithms dedicated entirely to GPUs made it possible to envisage efficient implementations. In fact, Haidar et al [8] shows that in modern GPU architectures, GPU-only codes can achieve higher performance than the hybrid algorithms when the difficult-to-parallelize CPU tasks and communications cannot be overlapped entirely by the GPU computations, a typical advantage observed in hybrid implementations. With that in mind, we opt to use a parallel GPU-only LU factorization in our implementation of the scattering algorithm.…”
Section: Using Gpu To Solve the Critical Pathmentioning
confidence: 99%
“…We propose a two-pass RSVD algorithm named block randomized SVD (BRSVD), which accesses the input data only twice in the whole computation. Similar to the GPU-only strategy [21], BRSVD uses GPUs for all computations which fully utilizes the power of accelerators and efficiently processes data without burdening the host CPU. BRSVD decomposes the original power method into independent block executions to reduce access to the target matrix.…”
Section: Introductionmentioning
confidence: 99%