2020
DOI: 10.1016/j.cam.2019.112701
|View full text |Cite
|
Sign up to set email alerts
|

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(5 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…Combined with optimizations such as software prefetching and parameter tuning, their implementation achieves 1.11× speedup on matrix multiplication compared to cuBLAS. Mukunoki et al [39] evaluated the parallelized linear algebra kernels with multiple data precisions on GPUs. Ryoo et al [45] summarized the general principles of matrix multiplication optimizations on GPU.…”
Section: High Performance Gemmmentioning
confidence: 99%
“…Combined with optimizations such as software prefetching and parameter tuning, their implementation achieves 1.11× speedup on matrix multiplication compared to cuBLAS. Mukunoki et al [39] evaluated the parallelized linear algebra kernels with multiple data precisions on GPUs. Ryoo et al [45] summarized the general principles of matrix multiplication optimizations on GPU.…”
Section: High Performance Gemmmentioning
confidence: 99%
“…The paper [15] presents highly optimized GPU implementations of the DOT, GEMV, GEMM, and SpMV operations, which are included in the BLAS-DOT2 package. In these implementations, internal floating-point operations are performed with at least 2-fold the precision of the input and output data precision, namely, for binary32 data, the computation is performed using the binary64 format, whereas for binary64 data, the computation is performed using the Dot2 algorithm [16], which is based on error-free transformations.…”
Section: Related Workmentioning
confidence: 99%
“…The use of long accumulators provides the replacement of non-associative floating-point operations with fixedpoint operations that are associative. The paper [15] presents highly optimized GPU implementations of the DOT, GEMV, GEMM, and SpMV operations, which are included in the BLAS-DOT2 package. In these implementations, internal floating-point operations are performed with at least 2-fold the precision of the input and output data precision, namely, for binary32 data, the computation is performed using the binary64 format, whereas for binary64 data, the computation is performed using the Dot2 algorithm [16], which is based on error-free transformations.…”
Section: Related Workmentioning
confidence: 99%