Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

Mukunoki, Daichi; Ogita, Takeshi

doi:10.1016/j.cam.2019.112701

Cited by 14 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combined with optimizations such as software prefetching and parameter tuning, their implementation achieves 1.11× speedup on matrix multiplication compared to cuBLAS. Mukunoki et al [39] evaluated the parallelized linear algebra kernels with multiple data precisions on GPUs. Ryoo et al [45] summarized the general principles of matrix multiplication optimizations on GPU.…”

Section: High Performance Gemmmentioning

confidence: 99%

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Liu,

Dun

et al. 2021

Preprint

View full text Add to dashboard Cite

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.

show abstract

Section: High Performance Gemmmentioning

confidence: 99%

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Liu,

Dun

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The paper [15] presents highly optimized GPU implementations of the DOT, GEMV, GEMM, and SpMV operations, which are included in the BLAS-DOT2 package. In these implementations, internal floating-point operations are performed with at least 2-fold the precision of the input and output data precision, namely, for binary32 data, the computation is performed using the binary64 format, whereas for binary64 data, the computation is performed using the Dot2 algorithm [16], which is based on error-free transformations.…”

Section: Related Workmentioning

confidence: 99%

Multiple-Precision BLAS Library for Graphics Processing Units

Isupov

Knyazkov

2020

Communications in Computer and Information Science

View full text Add to dashboard Cite

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This paper presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

show abstract

“…The use of long accumulators provides the replacement of non-associative floating-point operations with fixedpoint operations that are associative. The paper [15] presents highly optimized GPU implementations of the DOT, GEMV, GEMM, and SpMV operations, which are included in the BLAS-DOT2 package. In these implementations, internal floating-point operations are performed with at least 2-fold the precision of the input and output data precision, namely, for binary32 data, the computation is performed using the binary64 format, whereas for binary64 data, the computation is performed using the Dot2 algorithm [16], which is based on error-free transformations.…”

Section: Related Workmentioning

confidence: 99%

Multiple-Precision BLAS Library for Graphics Processing Units

Isupov¹,

Knyazkov²

2020

Preprint

View full text Add to dashboard Cite

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

show abstract

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

Cited by 14 publications

References 14 publications

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Multiple-Precision BLAS Library for Graphics Processing Units

Multiple-Precision BLAS Library for Graphics Processing Units

Contact Info

Product

Resources

About