Multiple-Precision BLAS Library for Graphics Processing Units

Isupov, Konstantin; Knyazkov, Vladimir

doi:10.1007/978-3-030-64616-5_4

Cited by 6 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We directly use data supported by work [57] as the GPU results. The work [58] proposes a new multiple precision algorithm that is faster than CAMPARY when requiring greater than 106 bits of precision. However, in this work, all libraries are tested based on a single-precision version on the NVIDIA GTX architecture, which is much slower than the double-precision version.…”

Section: Comparisons With Other Architecturesmentioning

confidence: 99%

Benchmarking 50-Photon Gaussian Boson Sampling on the Sunway TaihuLight

Gan

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a 100 × 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.

show abstract

Section: Comparisons With Other Architecturesmentioning

confidence: 99%

Benchmarking 50-Photon Gaussian Boson Sampling on the Sunway TaihuLight

Gan

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…The authors show up to 19× speedup on a Fermi-based Tesla C2075 GPU over a consumer-grade quad-core Sandy Bridge CPU running MPFR, dropping to ∼1× for 424-bit mantissas. MPRES-BLAS [37] presents GPU acceleration of APFP dense linear algebra, showing ∼2× speedup over CAMPARY for GEMM, reporting ∼100-120 MOp/s for 424-bit precision on a GTX 1080 GPU. Lei et al [38] implement an APFP accelerator on a Virtex 6 FPGA and report 11.6× speedup for 1024-bit multiplication over MPFR running on a dual-core Core i3 530 Clarkdalebased CPU.…”

Section: Related Workmentioning

confidence: 99%

Fast Arbitrary Precision Floating Point on FPGA

Licht

Pattison

Ziogas

et al. 2022

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the superlinear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compiletime fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dualsocket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8× speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3× speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351× and 191× CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10× speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375× CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLSbased code provides as high-level software interface for plugand-play acceleration, published as an open source project.

show abstract