Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara, Ali; Keyes, David E.; Ltaief, Hatem

doi:10.1145/3267101

Cited by 14 publications

(9 citation statements)

References 36 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Referred as batched matrix operations [1,10,15], the idea behind these advanced numerical kernels is to simultaneously execute many linear algebra kernels accessing different matrices so that one may achieve high hardware occupancy. While the support from optimized vendor libraries has improved over the last few years, developers may still have to implement their own kernels (e.g., on GPUs) or simply fall back to the OpenMP for loop pragma to execute kernels in batched mode.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

Hong

Ltaief

Ravasi

et al. 2021

JSFI

View full text Add to dashboard Cite

With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of the Earth must be numerically re-positioned inside the subsurface where reflections have originated, a process referred to as redatuming. The recently developed Marchenko method is able to handle fullwavefield data including multiple arrivals. A downside of this approach is that a multi-dimensional convolution operator must be repeatedly evaluated to solve an expensive inverse problem. As such an operator applies multiple dense matrix-vector multiplications (MVM), we identify and leverage the data sparsity structure for each frequency matrix and propose to accelerate the MVM step using tile low-rank (TLR) matrix approximations. We study the TLR impact on time-to-solution for the MVM using different accuracy thresholds whilst at the same time assessing the quality of the resulting subsurface seismic wavefields and show that TLR leads to a minimal degradation in terms of signal-to-noise ratio on a 3D synthetic dataset. We mitigate the load imbalance overhead and provide performance evaluation on two distributed-memory systems. Our MPI+OpenMP TLR-MVM implementation reaches up to 3X performance speedup against the dense MVM counterpart from NEC scientific library on 128 NEC SX-Aurora TSUBASA cards. Thanks to the second generation of high bandwidth memory technology, it further attains up to 67X performance speedup compared to the dense MVM from Intel MKL when running on 128 dual-socket 20-core Intel Cascade Lake nodes with DDR4 memory. This corresponds to 110 TB/s of aggregated sustained bandwidth for our TLR-MVM implementation, without suffering deterioration in the quality of the reconstructed seismic wavefields.

show abstract

Section: Related Workmentioning

confidence: 99%

Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

Hong

Ltaief

Ravasi

et al. 2021

JSFI

View full text Add to dashboard Cite

show abstract

“…Differently from the work presented in this paper, they specialize on triangular matrices. In [7], the authors also adapt this blocking strategy to handle batched operations on small matrix sizes (up to 256) to stress the register usage and maintain data locality. In [28], Elmar and Bientinesi introduce ReLAPACK, a collection of recursive algorithms for dense Linear Algebra.…”

Section: Related Workmentioning

confidence: 99%

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

Arrigoni¹,

Maggioli²,

Massini³

et al. 2021

Preprint

View full text Add to dashboard Cite

The multiplication of a matrix by its transpose, A 𝑇 A, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to 2 /3 the time required by Strassen's algorithm, amounting to 14 3 𝑛 log 2 7 floating point operations. AtA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of AtA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of A 𝑇 A. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

show abstract

“…Batched GPU routines for LU, Cholesky and QR factorizations have been developed in [5,6,9] using a block recursive approach which increases data reuse and leads to very good performance for relatively large matrix sizes. GPU routines optimized for computing the QR decomposition of very tall and skinny matrices are presented in [10] where they develop an efficient transpose matrix-vector computation that is employed with some minor changes in this work.…”

Section: Related Workmentioning

confidence: 99%

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

et al. 2018

Self Cite

View full text Add to dashboard Cite

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.

show abstract

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Cited by 14 publications

References 36 publications

Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

Contact Info

Product

Resources

About