High-performance Cholesky factorization for GPU-only execution

Haidar, Azzam; Abdelfatah, Ahmad; Tomov, Stanimire; Dongarra, Jack

doi:10.1145/3038228.3038237

Cited by 12 publications

(7 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hybrid CPU-GPU algorithms, for instance, incur memory transfer and synchronization overhead. For smaller batch sizes, GPU-only implementations have been shown to offer better overall energy efficiency on mixed parallel/serial algorithms [Haidar et al 2017]. Likewise, the latency of GPU tasks and data communication has been shown as an important factor affecting hybrid performance [Wong and Aamodt 2009].…”

Section: Background 21 Revisiting Closely-coupled Parallel Acceleratorsmentioning

confidence: 99%

“…Tino et al platforms has a cost, both in programmability and efficiency. For instance, the energy efficiency benefits of heterogeneous systems are negated due to the communication and synchronization overhead incurred by hybrid algorithms [Haidar et al 2017]. Likewise, the serial performance provided by GPUs demonstrates an impact on overall system performance [Wong and Aamodt 2009].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Simt-X

Tino

Collange

Seznec

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors. SIMT-X leverages the existing Single Instruction Multiple Data (SIMD) back-end to provide CPU/GPU-like processing on a single core with minimal overhead. We demonstrate that although SIMT-X invokes a restricted form of Out-of-Order (OoO), the microarchitecture successfully captures a majority of the benefits of aggressive OoO execution using at most two concurrent register mappings per architectural register, while addressing issues of partial dependencies and supporting a general-purpose Instruction Set Architecture (ISA).

show abstract

Section: Background 21 Revisiting Closely-coupled Parallel Acceleratorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Simt-X

Tino

Collange

Seznec

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…The residual column vector ( 0 ) , treated computationally as a 6 -length real array, is computed based on Eqs. (5), (7) and (8). Tensors 1 and 2 that we obtained in the first stage are required for this procedure.…”

Section: Computing the Residual Column Vectormentioning

confidence: 99%

“…Still, the evolution of those algorithms dedicated entirely to GPUs made it possible to envisage efficient implementations. In fact, Haidar et al [8] shows that in modern GPU architectures, GPU-only codes can achieve higher performance than the hybrid algorithms when the difficult-to-parallelize CPU tasks and communications cannot be overlapped entirely by the GPU computations, a typical advantage observed in hybrid implementations. With that in mind, we opt to use a parallel GPU-only LU factorization in our implementation of the scattering algorithm.…”

Section: Using Gpu To Solve the Critical Pathmentioning

confidence: 99%

Toward an ultrasonic inspecting method to detect and classify adhesive bonding defects in real time: a numeric study

Ribeiro

Leiderman

Clua

2020

J Braz. Soc. Mech. Sci. Eng.

View full text Add to dashboard Cite

Adhesive bonding is an efficient method to join different components in structural design. However, a reliable nondestructive inspecting method to attest the integrity of adhesive bonds is still an open task. In the last few decades, many researchers have put effort into addressing this demand, and the methods based on ultrasound have emerged as the most promising ones. It is consensual that the capability of modeling both mathematically and computationally the interaction between ultrasonic waves and adhesive bonds will play a crucial role in the development of any ultrasonic inspecting method. In that sense, in a previous work, an algorithm to compute the scattering of ultrasonic waves by defective adhesive bonds was developed and implemented. In the present work, we revisit the algorithm and develop a novel GPU parallel implementation, aiming to reduce considerably the execution time. As shown, our new implementation has reduced the execution time by a factor of around 25, opening the possibility for solving the correlated inverse problem in real time. To the best of our knowledge, this is the first time in the literature that GPU is employed to solve this particular ultrasonic scattering problem.

show abstract

“…We propose a two-pass RSVD algorithm named block randomized SVD (BRSVD), which accesses the input data only twice in the whole computation. Similar to the GPU-only strategy [21], BRSVD uses GPUs for all computations which fully utilizes the power of accelerators and efficiently processes data without burdening the host CPU. BRSVD decomposes the original power method into independent block executions to reduce access to the target matrix.…”

Section: Introductionmentioning

confidence: 99%

Block Randomized Singular Value Decomposition on GPUs

Matsushita

Ino

2020

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Fast computation of singular value decomposition (SVD) is of great interest in various machine learning tasks. Recently, SVD methods based on randomized linear algebra have shown significant speedup in this regime. For processing large-scale data, computing systems with accelerators like GPUs have become the mainstream approach. In those systems, access to the input data dominates the overall process time; therefore, it is needed to design an out-of-core algorithm to dispatch the computation into accelerators. This paper proposes an accurate two-pass randomized SVD, named block randomized SVD (BRSVD), designed for matrices with a slow-decay singular spectrum that is often observed in image data. BRSVD fully utilizes the power of modern computing system architectures and efficiently processes large-scale data in a parallel and out-of-core fashion. Our experiments show that BRSVD effectively moves the performance bottleneck from data transfer to computation, so that outperforms existing randomized SVD methods in terms of speed with retaining similar accuracy.

show abstract

High-performance Cholesky factorization for GPU-only execution

Cited by 12 publications

References 21 publications

Simt-X

Simt-X

Toward an ultrasonic inspecting method to detect and classify adhesive bonding defects in real time: a numeric study

Block Randomized Singular Value Decomposition on GPUs

Contact Info

Product

Resources

About