Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

Boukaram, Wajih; Turkiyyah, George; Ltaief, Hatem; Keyes, David E.

doi:10.1016/j.parco.2017.09.001

Cited by 45 publications

(23 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Performance is obtained because the large amount of compute-intensive factorizations, both QR and SVD, that are performed at every level can be efficiently executed by batched kernels. We have developed batched QR and batched adaptive randomized SVD operations for this purpose [26,27].…”

Section: (B) Linear Algebra Operations With Hierarchical Matricesmentioning

confidence: 99%

Hierarchical algorithms on hierarchical architectures

Keyes

Ltaief

Turkiyyah

2020

Phil. Trans. R. Soc. A.

View full text Add to dashboard Cite

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

Section: (B) Linear Algebra Operations With Hierarchical Matricesmentioning

confidence: 99%

Hierarchical algorithms on hierarchical architectures

Keyes

Ltaief

Turkiyyah

2020

Phil. Trans. R. Soc. A.

View full text Add to dashboard Cite

show abstract

“…If additional rank reduction is required (to satisfy some fixed error threshold), an approximate singular value decomposition can be easily obtained from this low rank form by computing the QR decomposition of B and then the SVD of the small k ×k triangular factor. This method has been implemented as a batched GPU routine and used to accelerate the compression of hierarchical matrices [7] in place of the full singular value decomposition.…”

Section: Notation and Definitionsmentioning

confidence: 99%

Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations

Boukaram¹,

Turkiyyah²,

Keyes³

2019

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H 2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H 2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hierarchical matrix computations. Numerical experiments demonstrate the high performance of the algorithms and their effectiveness in generating hierarchical matrices to a desired target accuracy.

show abstract

“…In [Abdelfa ah et al 2016a,b,d], a newer implementation in MAGMA is proposed for handling batched matrix factorizations with variable sizes, which has also been of interests in the context of accelerating sparse linear algebra [SuiteSparse 2017] during the Schur complement calculations. More recently, some of the authors have proposed new batched QR and SVD kernels for very small matrix sizes with applications in the compression of hierarchical matrices [Boukaram et al 2017].…”

Section: Related Workmentioning

confidence: 99%

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara

Keyes

Ltaief

2019

ACM Trans. Math. Softw.

Self Cite

View full text Add to dashboard Cite

Batched dense linear algebra kernels are becoming ubiquitous in scienti c applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.A. Charara et al.However, some of the most critical applications currently of high interests in the HPC community, especially in data analytics, face a major performance bo leneck due to the inadequacy of legacy BLAS/LAPACK frameworks. For instance, tensor contractions for deep learning and hierarchical low rank data-sparse matrix computations [Hackbusch 1999;Hackbusch and Khoromskij 2000] are key operations for solving partial di erential equations. Interestingly, the bulk of the computation of these operations typically resides in performing thousands of independent dense linear algebra operations on very small sizes (usually less than 100). Even the highly vendor-optimized sequential implementations may not cope with the overhead of the memory latency at these tiny sizes. Moreover, calling the sequential version of the dense linear algebra functions within an embarrassingly parallel OpenMP loop may not be an option, due to the API overhead (i.e., parameters sanity check, memory initialization, etc.), which does not get compensated in return because of the low arithmetic intensity of the kernel operations. is is further exacerbated by hardware with a large number of threads, such as GPUs with many streaming multiprocessors, for which high occupancy may not be reached, bandwidth may not get saturated, and thread parallelism may not be exploited given the small workloads. At present, vendors currently provide only a subset of the overall batched linear algebra operations, with limited support for very small problem sizes.is paper describes the high-performance implementations on GPUs of various batched triangular dense linear algebra operations targeting very small sizes (up to 256 in dimension), which are currently either poorly supported or not at all. ere are two main algorithmic adaptations, which may address this challenge: designing synchronization-reducing (i.e., strong scaling) and communication-reducing (i.e., data motion avoiding) algorithms. Alth...

show abstract

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

Cited by 45 publications

References 23 publications

Hierarchical algorithms on hierarchical architectures

Hierarchical algorithms on hierarchical architectures

Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Contact Info

Product

Resources

About