Acceleration of in-core LU-decomposition of dense MoM matrix by parallel usage of multiple GPUs

Mrdakovic, Branko Lj.; Kostic, Milan M.; Olćan, Dragan I.; Kolundžija, Branko M.

doi:10.1109/comcas.2017.8244769

Cited by 7 publications

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhou (2016) [5] performed LU decomposition on huge and complex matrices based on the idea of parallel computing and transformed three optimal parallel computing processing methods using three parallel modes (OpenMP, MPI, PPL). Branko et al (2017) [6] proposed using multiple GPUs in parallel to accelerate in-kernel LU decomposition and using blocks to overcome the memory limitations of GPUs. Volkov et al (2009) [7] implemented right-looking algorithms for LU, Cholesky, and QR for GPUs, which are similar to those in the GPU-based advanced linear algebra library MAGMA (2014) [8], changing the data layout by transposing matrices.…”

Section: Introductionmentioning

confidence: 99%

High-Performance Batched LU Decomposition on GPU

Lei

Bao

2022

Advances in Transdisciplinary Engineering

View full text Add to dashboard Cite

LU decomposition is an important computational step in many engineering and scientific computing problems. In most of critical applications, many small-scale problems need to be solved instead of a few large linear systems. However, when facing with small or medium sized matrices, existing batched LU decomposition algorithms suffer from the global memory access latency bottleneck, and the performance is poor. We implement a series of specialized optimized batched GPU-based LU decomposition algorithms for this situation, and two outperforming algorithms are selected after a systematic testing. They both achieved speedup ratio greater than 3 compared with cuBLAS, and even greater than 10 in some cases.

show abstract

Section: Introductionmentioning

confidence: 99%