Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Kim, Sangpyo; Jung, Wonkyung; Park, Jaiyoung; Ahn, Jung Ho

doi:10.1109/iiswc50251.2020.00033

Cited by 44 publications

(30 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These show why the use of AVX-512 achieves an insufficient degree of performance improvement and also imply that we can substantially accelerates HE Mul by natively supporting these SIMD instructions. Previous work [49], [53] also showed that SIMD is effective in accelerating NTT and iNTT on CPUs and GPUs. Impact of Q on the characteristics of HE Mul: Q determines multiplicative depth L; a larger depth requires a bigger Q.…”

Section: Discussion and Related Workmentioning

confidence: 94%

See 1 more Smart Citation

Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization

et al. 2021

Self Cite

View full text Add to dashboard Cite

Homomorphic Encryption (HE) draws a significant attention as a privacy-preserving way for cloud computing because it allows computation on encrypted messages called ciphertexts. Among numerous HE schemes proposed, HE for Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining popularity across a wide range of applications as it supports messages that can tolerate approximate computation with no limit on the number of arithmetic operations applicable to the ciphertexts. A critical shortcoming of HE is the high computation complexity of ciphertext arithmetic; especially, HE multiplication (HE Mul) is more than 10,000 times slower than the corresponding multiplication between unencrypted messages. This leads to a large body of HE acceleration studies including ones exploiting FPGAs; however, those did not conduct a rigorous analysis of computational complexity and data access patterns of HE Mul. Moreover, the proposals mostly focused on designs with small parameter sizes, making it difficult to accurately estimate the performance of the HE accelerators in conducting a series of complex arithmetic operations. In this paper, we first describe how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts. Then we conduct a disciplined analysis on its computational and memory-access characteristics, through which we (1) extract parallelism in the key functions composing HE Mul and (2) demonstrate how to effectively map the parallelism to the popular parallel processing platforms, CPUs and GPUs, by applying a series of optimizations such as transposing matrices and pinning data to threads. This leads to the performance improvement of HE Mul on a CPU and a GPU by 2.06× and 4.05×, respectively, over the reference HEAAN running on a CPU with 24 threads. INDEX TERMSComputer applications, Computer architecture, Cryptography, Multicore processing I. INTRODUCTION

show abstract

Section: Discussion and Related Workmentioning

confidence: 94%

“…For NTT and iNTT, [49] characterizes various NTT implementations, including the high-radix approach in this paper, and suggests on-the-fly twiddle factor generation. Another approach [10] is to exploit Discrete Galois Transform (DGT)…”

Section: Discussion and Related Workmentioning

confidence: 99%

Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…For the CPU implementation of NTT, we use the same approach as in the RNS operation, where each thread takes N residues (i.e., a (i) for a given i) at a time, and perform N -point NTT. For the GPU implementation, we use the hierarchical NTT implementation [KJPA20], which heavily exploits shared memory in GPUs while adopting an earlier approach in [GLD + 08]. Specifically, for every (i)NTT with N residues, we use 8 per-thread (i)NTT kernels, as described in [KJPA20], where each thread in a kernel loads eight residues into the registers at a time.…”

Section: Basic He Operationsmentioning

confidence: 99%

“…For the GPU implementation, we use the hierarchical NTT implementation [KJPA20], which heavily exploits shared memory in GPUs while adopting an earlier approach in [GLD + 08]. Specifically, for every (i)NTT with N residues, we use 8 per-thread (i)NTT kernels, as described in [KJPA20], where each thread in a kernel loads eight residues into the registers at a time. We launch kernels each performing radix-256 or radix-512 (i)NTT, where radix-k divides an N -point transformation into k interleaved N/k-point transformations.…”

Section: Basic He Operationsmentioning

confidence: 99%

“…Our baseline (i)NTT implementation [KJPA20] launches two GPU kernels, each of which performs radix-√ N (i)NTT, resulting in 2N log N/ log √ N accesses for the input and output. Rather than Barrett's algorithm [Bar87], we adopt Shoup's ModMul, as implemented in [KJPA20], which is commonly used for (i)NTT to reduce the operational complexity of ModMuls [CLP17,HS14]. Using Shoup's method adds extra N memory accesses as it demands a precomputed value for each ModMul.…”

Section: # Of Modmulsmentioning

confidence: 99%

See 1 more Smart Citation

Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs

Jung

Kim

Ahn

et al. 2021

TCHES

Self Cite

View full text Add to dashboard Cite

Fully Homomorphic encryption (FHE) has been gaining in popularity as an emerging means of enabling an unlimited number of operations in an encrypted message without decryption. A major drawback of FHE is its high computational cost. Specifically, a bootstrapping step that refreshes the noise accumulated through consequent FHE operations on the ciphertext can even take minutes of time. This significantly limits the practical use of FHE in numerous real applications.By exploiting the massive parallelism available in FHE, we demonstrate the first instance of the implementation of a GPU for bootstrapping CKKS, one of the most promising FHE schemes supporting the arithmetic of approximate numbers. Through analyzing CKKS operations, we discover that the major performance bottleneck is their high main-memory bandwidth requirement, which is exacerbated by leveraging existing optimizations targeted to reduce the required computation. These observations motivate us to utilize memory-centric optimizations such as kernel fusion and reordering primary functions extensively.Our GPU implementation shows a 7.02× speedup for a single CKKS multiplication compared to the state-of-the-art GPU implementation and an amortized bootstrapping time of 0.423us per bit, which corresponds to a speedup of 257× over a single-threaded CPU implementation. By applying this to logistic regression model training, we achieved a 40.0× speedup compared to the previous 8-thread CPU implementation with the same data.

show abstract

FPGA Acceleration of Number Theoretic Transform

Tian

Yang

Kuppannagari

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Cited by 44 publications

References 28 publications

Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization

Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization

Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs

FPGA Acceleration of Number Theoretic Transform

Contact Info

Product

Resources

About