Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Mukunoki, Daichi; Takahashi, Daisuke

doi:10.1109/ipdpsw.2012.175

Cited by 12 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Triple precision (double + single float) implementations of BLAS routines on GPUs were presented in [16]. Related to polynomial system solving on a GPU, we mention two recent works.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Verschelde

Yoffe

2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

Our problem is to accurately solve linear systems on a general purpose graphics processing unit with double double and quad double arithmetic. The linear systems originate from the application of Newton's method on polynomial systems. Newton's method is applied as a corrector in a path tracking method, so the linear systems are solved in sequence and not simultaneously. One solution path may require the solution of thousands of linear systems. In previous work we reported good speedups with our implementation to evaluate and differentiate polynomial systems on the NVIDIA Tesla C2050. Although the cost of evaluation and differentiation often dominates the cost of linear system solving in Newton's method, because of the limited bandwidth of the communication between CPU and GPU, we cannot afford to send the linear system to the CPU for solving during path tracking.Because of large degrees, the Jacobian matrix may contain extreme values, requiring extended precision, leading to a significant overhead. This overhead of multiprecision arithmetic is our main motivation to develop a massively parallel algorithm. To allow overdetermined linear systems we solve linear systems in the least squares sense, computing the QR decomposition of the matrix by the modified Gram-Schmidt algorithm. We describe our implementation of the modified Gram-Schmidt orthogonalization method using double double and quad double arithmetic for GPUs. Our experimental results on the NVIDIA Tesla C2050 and K20C show that the achieved speedups are sufficiently high to compensate for the overhead of one extra level of precision.Keywords double double arithmetic, general purpose graphics processing unit (GPU), massively parallel algorithm, modified Gram-Schmidt method, orthogonalization, quad double arithmetic, quality up.

show abstract

“…Triple precision (double + single float) implementations of BLAS routines on GPUs were presented in [16]. Related to polynomial system solving on a GPU, we mention two recent works.…”

Section: Related Workmentioning

confidence: 99%

“…Triple precision (double + single float) implementations of BLAS routines on GPUs were presented in [16].…”

Section: Related Workmentioning

confidence: 99%

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Verschelde

Yoffe

2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

show abstract

“…One such approach is the double-double and quad-double precision, where a single value is represented as the sum of two and four FP64 values, respectively, and arithmetic operations are performed using a sequence of FP64 operations (Hida et al, 2001). GEMM and other BLAS functions for double-double precision have been evaluated on NVIDIA GPUs and AMD Cypress GPUs (Mukunoki and Takahashi, 2012;Nakasato, 2011). Another approach is the Ozaki scheme, which also splits a value into multiple lower-precision values (Ozaki et al, 2012(Ozaki et al, , 2013.…”

Section: Introductionmentioning

confidence: 99%

DGEMM on integer matrix multiplication unit

Ootomo,

Ozaki,

Yokota

2024

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on computing double-precision equivalent matrix multiplication using the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.85 while maintaining the FP64 accuracy.

show abstract

“…The suitability of double double and triple precision Basic Linear Algebra Subroutines (BLAS) was shown in [15,16].…”

Section: Introductionmentioning

confidence: 99%

Least Squares on GPUs in Multiple Double Precision

Verschelde¹

2021

Preprint

View full text Add to dashboard Cite

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder QR and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled.Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the QR decomposition dominates.In doubling the precision from double double to quad double and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.

show abstract

Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Cited by 12 publications

References 8 publications

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

DGEMM on integer matrix multiplication unit

Least Squares on GPUs in Multiple Double Precision

Contact Info

Product

Resources

About