Daichi Mukunoki scite author profile

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs). Tensor Cores are special processing units that perform 4 × 4 matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. Keywords: Tensor cores • FP16 • Half-precision • Low-precision • Matrix multiplication • GEMM • Linear algebra • Accuracy • Reproducibility

show abstract

Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?

Domke

Vatai

Drozd

et al. 2021

View full text Add to dashboard Cite

Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs

Mukunoki

Takahashi

2013

View full text Add to dashboard Cite

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

Mukunoki

Ogita

2020

Journal of Computational and Applied Mathematics

View full text Add to dashboard Cite

Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Mukunoki

Takahashi

2012

View full text Add to dashboard Cite

We implemented and evaluated the triple precision Basic Linear Algebra Subprograms (BLAS) subroutines, AXPY, GEMV and GEMM on a Tesla C2050. In this paper, we present a Double+Single (D+S) type triple precision floating-point value format and operations. They are based on techniques similar to Double-Double (DD) type quadruple precision operations. On the GPU, the D+S-type operations are more costly than the DD-type operations in theory and in practice. Therefore, the triple precision GEMM, which is a compute-bound operation, is slower than the quadruple precision GEMM. However, the triple precision AXPY and GEMV are memory-bound operations on the GPU, thus their execution time of these triple precision subroutines is close to 3/4 of the quadruple precision subroutines. Therefore, we conclude that the triple precision value format is useful for memory-bound operations, in cases where the quadruple precision is not required, but double precision is not sufficient.

show abstract

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

Jézéquel

Graillat

Mukunoki

et al. 2020

View full text Add to dashboard Cite

Numerical validation enables one to ensure the reliability of numerical computations that rely on floating-point operations. Discrete Stochastic Arithmetic (DSA) makes it possible to validate the accuracy of floating-point computations using random rounding. However, it may bring a large performance overhead compared with the standard floatingpoint operations. In this article, we show that with perturbed data it is possible to use standard floating-point arithmetic instead of DSA for the purpose of numerical validation. For instance, for codes including matrix multiplications, we can directly utilize the matrix multiplication routine (GEMM) of level-3 BLAS that is performed with standard floating-point arithmetic. Consequently, we can achieve a significant performance improvement by avoiding the performance overhead of DSA operations as well as by exploiting the speed of highly-optimized BLAS implementations. Finally, we demonstrate the performance gain using Intel MKL routines compared against the DSA version of BLAS routines.

show abstract

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme

Mukunoki

Ozaki

Ogita

et al. 2021

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daichi Mukunoki

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?

Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme

Contact Info

Product

Resources

About