Abstract:The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This paper presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C+… Show more
“…We directly use data supported by work [57] as the GPU results. The work [58] proposes a new multiple precision algorithm that is faster than CAMPARY when requiring greater than 106 bits of precision. However, in this work, all libraries are tested based on a single-precision version on the NVIDIA GTX architecture, which is much slower than the double-precision version.…”
Section: Comparisons With Other Architecturesmentioning
Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a 100 × 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.
“…We directly use data supported by work [57] as the GPU results. The work [58] proposes a new multiple precision algorithm that is faster than CAMPARY when requiring greater than 106 bits of precision. However, in this work, all libraries are tested based on a single-precision version on the NVIDIA GTX architecture, which is much slower than the double-precision version.…”
Section: Comparisons With Other Architecturesmentioning
Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a 100 × 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.
“…The authors show up to 19× speedup on a Fermi-based Tesla C2075 GPU over a consumer-grade quad-core Sandy Bridge CPU running MPFR, dropping to ∼1× for 424-bit mantissas. MPRES-BLAS [37] presents GPU acceleration of APFP dense linear algebra, showing ∼2× speedup over CAMPARY for GEMM, reporting ∼100-120 MOp/s for 424-bit precision on a GTX 1080 GPU. Lei et al [38] implement an APFP accelerator on a Virtex 6 FPGA and report 11.6× speedup for 1024-bit multiplication over MPFR running on a dual-core Core i3 530 Clarkdalebased CPU.…”
Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the superlinear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compiletime fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dualsocket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8× speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3× speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351× and 191× CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10× speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375× CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLSbased code provides as high-level software interface for plugand-play acceleration, published as an open source project.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.