2019
DOI: 10.1145/3267101
|View full text |Cite
|
Sign up to set email alerts
|

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Abstract: Batched dense linear algebra kernels are becoming ubiquitous in scienti c applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(9 citation statements)
references
References 36 publications
(32 reference statements)
0
9
0
Order By: Relevance
“…Referred as batched matrix operations [1,10,15], the idea behind these advanced numerical kernels is to simultaneously execute many linear algebra kernels accessing different matrices so that one may achieve high hardware occupancy. While the support from optimized vendor libraries has improved over the last few years, developers may still have to implement their own kernels (e.g., on GPUs) or simply fall back to the OpenMP for loop pragma to execute kernels in batched mode.…”
Section: Related Workmentioning
confidence: 99%
“…Referred as batched matrix operations [1,10,15], the idea behind these advanced numerical kernels is to simultaneously execute many linear algebra kernels accessing different matrices so that one may achieve high hardware occupancy. While the support from optimized vendor libraries has improved over the last few years, developers may still have to implement their own kernels (e.g., on GPUs) or simply fall back to the OpenMP for loop pragma to execute kernels in batched mode.…”
Section: Related Workmentioning
confidence: 99%
“…Differently from the work presented in this paper, they specialize on triangular matrices. In [7], the authors also adapt this blocking strategy to handle batched operations on small matrix sizes (up to 256) to stress the register usage and maintain data locality. In [28], Elmar and Bientinesi introduce ReLAPACK, a collection of recursive algorithms for dense Linear Algebra.…”
Section: Related Workmentioning
confidence: 99%
“…Batched GPU routines for LU, Cholesky and QR factorizations have been developed in [5,6,9] using a block recursive approach which increases data reuse and leads to very good performance for relatively large matrix sizes. GPU routines optimized for computing the QR decomposition of very tall and skinny matrices are presented in [10] where they develop an efficient transpose matrix-vector computation that is employed with some minor changes in this work.…”
Section: Related Workmentioning
confidence: 99%