The residue number system (RNS) provides parallel, carry-free, and high-speed arithmetic and is therefore a good tool for high-performance computing. However, operations such as magnitude comparison, sign computation, overflow detection, scaling, and division are difficult to perform in RNS, since it is problematic to determine the magnitude of an RNS number. In order to resolve this problem, we propose to compute the interval evaluation of the fractional representation of an RNS number in floating-point arithmetic of limited precision. No matter what the size n of the moduli set and dynamic range, only small arithmetic operations are required, and most of the computations are performed in parallel with n threads, which allows for efficient implementation of our method on many general-purpose computing platforms. Using this method, we propose new algorithms for magnitude comparison and general division in RNS and implement them for GPUs using the CUDA platform. We evaluate the performance of our algorithms on an NVIDIA GTX 1080 GPU using sets of 4 to 256 RNS moduli that provide dynamic ranges from 64 to 4096 bits. Experimental results show that the proposed new algorithms are efficient for large moduli sets and clearly outperform the existing RNS magnitude comparison and division algorithms in terms of execution time. INDEX TERMS Residue number system, floating-point arithmetic, non-modular operations, magnitude comparison, division, high performance, parallel algorithms, graphics processing unit, CUDA.
Residue number system (RNS), due to its carry-free nature, is popular in many applications of high-speed computer arithmetic, especially in digital signal processing and cryptography. However, the main limiting factor of RNS is a high complexity of such operations as magnitude comparison, sign determination and overflow detection. These operations have, for many years, been a major obstacle to more widespread use of parallel residue arithmetic. This paper presents a new efficient method to perform these operations, which is based on computation and analysis of the interval estimation for the relative value of an RNS number. The estimation, which is called the interval floating-point characteristic (IFC), is represented by two directed rounded bounds that are fixed-precision numbers. Generally, the time complexities of serial and parallel computations of IFC are linear and logarithmic functions of the size of the moduli set, respectively. The new method requires only small-integer and fixed-precision floating-point operations and focuses on arbitrary moduli sets with large dynamic ranges ([Formula: see text]). Experiments indicate that the performance of the proposed method is significantly higher than that of methods based on Mixed-Radix Conversion.
Residue number system (RNS) is known for its parallel arithmetic and has been used in recent decades in various important applications, from digital signal processing and deep neural networks to cryptography and high-precision computation. However, comparison, sign identification, overflow detection, and division are still hard to implement in RNS. For such operations, most of the methods proposed in the literature only support small dynamic ranges (up to several tens of bits), so they are only suitable for low-precision applications. We recently proposed a method that supports arbitrary moduli sets with cryptographically sized dynamic ranges, up to several thousands of bits. The practical interest of our method compared to existing methods is that it relies only on very fast standard floating-point operations, so it is suitable for multiple-precision applications and can be efficiently implemented on many general-purpose platforms that support IEEE 754 arithmetic. In this paper, we make further improvements to this method and demonstrate that it can successfully be applied to implement efficient data-parallel primitives operating in the RNS domain, namely finding the maximum element of an array of RNS numbers on graphics processing units. Our experimental results on an NVIDIA RTX 2080 GPU show that for random residues and a 128-moduli set with 2048-bit dynamic range, the proposed implementation reduces the running time by a factor of 39 and the memory consumption by a factor of 13 compared to an implementation based on mixed-radix conversion.
We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.