Multiple-precision matrix-vector multiplication on graphics processing units

Isupov, Konstantin; Knyazkov, Vladimir

doi:10.25209/2079-3316-2020-11-3-61-84

Cited by 5 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The double double arithmetic of CAMPARY performs best for the problem of matrix-vector multiplication. In quad double precision, the authors of [9] write "the CAMPARY library is faster than our implementation; however as the precision increases the execution time of CAMPARY also increases significantly. "…”

Section: On Alternatives To Camparymentioning

confidence: 94%

See 1 more Smart Citation

Least Squares on GPUs in Multiple Double Precision

Verschelde¹

2021

Preprint

View full text Add to dashboard Cite

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder QR and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled.Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the QR decomposition dominates.In doubling the precision from double double to quad double and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.

show abstract

Section: On Alternatives To Camparymentioning

confidence: 94%

“…The authors of [9] compare CAMPARY and CUMP [18] to their GPU implementation of multiprecision arithmetic based on the multiple residue number system. The double double arithmetic of CAMPARY performs best for the problem of matrix-vector multiplication.…”

Section: On Alternatives To Camparymentioning

confidence: 99%

Least Squares on GPUs in Multiple Double Precision

Verschelde¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The other drawback is the limited size of the exponents (limited to the 11 bits of the 64-bit hardware doubles), which will prohibit the computation with infinitesimal values. In the context of GPU acceleration, recent work of [16] makes an interesting comparison with double double arithmetic: "The double double arithmetic of CAMPARY performs best for the problem of matrix-vector multiplication." Concerning quad double precision, the authors of [16] write: "the CAMPARY library is faster than our implementation; however as the precision increases the execution time of CAMPARY also increases significantly."…”

Section: Multiprecision Arithmeticmentioning

confidence: 99%

“…In the context of GPU acceleration, recent work of [16] makes an interesting comparison with double double arithmetic: "The double double arithmetic of CAMPARY performs best for the problem of matrix-vector multiplication." Concerning quad double precision, the authors of [16] write: "the CAMPARY library is faster than our implementation; however as the precision increases the execution time of CAMPARY also increases significantly." The advantage of multiple double arithmetic is that simple counts of the number of floating-point operations quantify the cost overhead precisely and the flops metrics for performance are directly applicable.…”

Section: Multiprecision Arithmeticmentioning

confidence: 99%

GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double Precision

Verschelde¹

2023

Preprint

View full text Add to dashboard Cite

A polynomial homotopy is a family of polynomial systems, typically in one parameter t. Our problem is to compute power series expansions of the coordinates of the solutions in the parameter t, accurately, using multiple double arithmetic. One application of this problem is the location of the nearest singular solution in a polynomial homotopy, via the theorem of Fabry. Power series serve as input to construct Padé approximations.Exploiting the massive parallelism of Graphics Processing Units capable of performing several trillions floating-point operations per second, the objective is to compensate for the cost overhead caused by arithmetic with power series in multiple double precision. The application of Newton's method for this problem requires the evaluation and differentiation of polynomials, followed by solving a blocked lower triangular linear system. Experimental results are obtained on NVIDIA GPUs, in particular the RTX 2080, P100 and V100.Code generated by the CAMPARY software is used to obtain results in double double, quad double, and octo double precision. The programs in this study are self contained, available in a public github repository under the GPL-v3.0 License.

show abstract

“…[11][12][13][14] Because all data information can be expressed in vector units, the matrix computation of data processing can be performed in a highly parallel dot-product manner as vector-matrix multiplication (VMM), which has distinct advantages and is now being developed mainly as an accelerator for inference, especially in neural network systems. [15][16][17][18][19][20][21] In addition, it has the potential to be a reconfigurable analog processor for signal processing, as each variable matrix element can be directly encoded in a matrix array to enable individual input signals to be transformed in a VMM manner. [3][4][5][6] Among big data-driven information applications, controlling traffic flow, especially in urban road networks and in conjunction with autonomous driving technology, is becoming a promising field.…”

Section: Introductionmentioning

confidence: 99%

Active Traffic Signal Decisions Using Vector‐Matrix Multiplication

Jang

Jeon

2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

A novel methodology in the manner of vector‐matrix multiplication (VMM) architecture is suggested for intelligently determining traffic signal changes to enhance the flow of urban traffic. Unlike the conventional prediction‐based traffic model, a real‐time decision model considering the traffic density at each transport section is established, which simplifies the traffic signal decision process as a convolutional transformation. Compared with a periodically repetitive signal changing system, the suggested VMM system actively optimizes the signal configuration in an irregular shape according to the traffic density distribution, resulting in reduction in the time cost with highly improved decision efficiency. With this system based on particle dynamics, the travel time is reduced by ≈10% at the same pass ratio for different road structures (one‐way, bidirectional, and intersectional transport). The pass ratio and resulting flow dynamics can be controllable using the different transformation matrix selections according to the traffic conditions. In addition, the analog conductance of the memristor device to the transformation matrix elements is applied, maintaining its reduction rate with a deviation tolerance of the VMM process up to ≈50%. It is believed that VMM‐based signal decision platform can lead to great progress for fast and efficient transport in complex urban traffic networks.

show abstract

Multiple-precision matrix-vector multiplication on graphics processing units

Cited by 5 publications

References 24 publications

Least Squares on GPUs in Multiple Double Precision

Least Squares on GPUs in Multiple Double Precision

GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double Precision

Active Traffic Signal Decisions Using Vector‐Matrix Multiplication

Contact Info

Product

Resources

About