Sujoy Sinha Roy scite author profile

In this paper, we introduce Saber, a package of cryptographic primitives whose security relies on the hardness of the Module Learning With Rounding problem (Mod-LWR). We first describe a secure Diffie-Hellman type key exchange protocol, which is then transformed into an IND-CPA encryption scheme and finally into an IND-CCA secure key encapsulation mechanism using a post-quantum version of the Fujisaki-Okamoto transform. The design goals of this package were simplicity, efficiency and flexibility resulting in the following choices: all integer moduli are powers of 2 avoiding modular reduction and rejection sampling entirely; the use of LWR halves the amount of randomness required compared to LWE-based schemes and reduces bandwidth; the module structure provides flexibility by reusing one core component for multiple security levels. A constant-time AVX2 optimized software implementation of the KEM with parameters providing more than 128 bits of post-quantum security, requires only 101K, 125K and 129K cycles for key generation, encapsulation and decapsulation respectively on a Dell laptop with an Intel i7-Haswell processor.

show abstract

Compact Ring-LWE Cryptoprocessor

Roy

Vercauteren

Mentens

et al. 2014

162

View full text Add to dashboard Cite

In this paper we propose an efficient and compact processor for a ring-LWE based encryption scheme. We present three optimizations for the Number Theoretic Transform (NTT) used for polynomial multiplication: we avoid preprocessing in the negative wrapped convolution by merging it with the main algorithm, we reduce the fixed computation cost of the twiddle factors and propose an advanced memory access scheme. These optimization techniques reduce both the cycle and memory requirements. Finally, we also propose an optimization of the ring-LWE encryption system that reduces the number of NTT operations from five to four resulting in a 20% speed-up. We use these computational optimizations along with several architectural optimizations to design an instruction-set ring-LWE cryptoprocessor. For dimension 256, our processor performs encryption/decryption operations in 20/9 µs on a Virtex 6 FPGA and only requires 1349 LUTs, 860 FFs, 1 DSP-MULT and 2 BRAMs. Similarly for dimension 512, the processor takes 48/21 µs for performing encryption/decryption operations and only requires 1536 LUTs, 953 FFs, 1 DSP-MULT and 3 BRAMs. Our processors are therefore more than three times smaller than the current state of the art hardware implementations, whilst running somewhat faster.

show abstract

FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data

Roy

Turan

Järvinen

et al. 2019

View full text Add to dashboard Cite

Homomorphic encryption is a tool that enables computation on encrypted data and thus has applications in privacy-preserving cloud computing. Though conceptually amazing, implementation of homomorphic encryption is very challenging and typically software implementations on general purpose computers are extremely slow. In this paper we present our domain specific architecture in a heterogeneous Arm+FPGA platform to accelerate homomorphic computing on encrypted data. We design a custom co-processor for the computationally expensive operations of the well-known Fan-Vercauteren (FV) homomorphic encryption scheme on the FPGA, and make the Arm processor a server for executing different homomorphic applications in the cloud, using this FPGA-based co-processor. We use the most recent arithmetic and algorithmic optimization techniques and perform design-space exploration on different levels of the implementation hierarchy. In particular we apply circuit-level and block-level pipeline strategies to boost the clock frequency and increase the throughput respectively. To reduce computation latency, we use parallel processing at all levels. Starting from the highly optimized building blocks, we gradually build our multicore multi-processor architecture for computing. We implemented and tested our optimized domain specific programmable architecture on Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. At 200 MHz FPGAclock, our implementation achieves over 13x speedup with respect to a highly optimized software implementation of the FV homomorphic encryption scheme on an Intel i5 processor running at 1.8 GHz.

show abstract

A Masked Ring-LWE Implementation

Reparaz

Roy

Vercauteren

et al. 2015

View full text Add to dashboard Cite

Lattice-based cryptography has been proposed as a postquantum public-key cryptosystem. In this paper, we present a masked ring-LWE decryption implementation resistant to first-order side-channel attacks. Our solution has the peculiarity that the entire computation is performed in the masked domain. This is achieved thanks to a new, bespoke masked decoder implementation. The output of the ring-LWE decryption are Boolean shares suitable for derivation of a symmetric key. We have implemented a hardware architecture of the masked ring-LWE processor on a Virtex-II FPGA, and have performed side channel analysis to confirm the soundness of our approach. The area of the protected architecture is around 2000 LUTs, a 20% increase with respect to the unprotected architecture. The protected implementation takes 7478 cycles to compute, which is only a factor ×2.6 larger than the unprotected implementation.

show abstract

High Precision Discrete Gaussian Sampling on FPGAs

Roy

Vercauteren

Verbauwhede

2014

View full text Add to dashboard Cite

Abstract. Lattice-based public key cryptography often requires sampling from discrete Gaussian distributions. In this paper we present an efficient hardware implementation of a discrete Gaussian sampler with high precision and large tail-bound based on the Knuth-Yao algorithm. The Knuth-Yao algorithm is chosen since it requires a minimal number of random bits and is well suited for high precision sampling. We propose a novel implementation of this algorithm based on an efficient traversal of the discrete distribution generating (DDG) tree. Furthermore, we propose optimization techniques to store the probabilities of the sample points in near-optimal space. Our implementation targets the Gaussian distribution parameters typically used in LWE encryption schemes and has maximum statistical distance of 2 −90 to a true discrete Gaussian distribution. For these parameters, our implementation on the Xilinx Virtex V platform results in a sampler architecture that only consumes 47 slices and has a delay of 3 ns.

show abstract

Efficient Ring-LWE Encryption on 8-Bit AVR Processors

Liu

Seo

Roy

et al. 2015

View full text Add to dashboard Cite

Abstract. Public-key cryptography based on the "ring-variant" of the Learning with Errors (ring-LWE) problem is both efficient and believed to remain secure in a post-quantum world. In this paper, we introduce a carefully-optimized implementation of a ring-LWE encryption scheme for 8-bit AVR processors like the ATxmega128. Our research contributions include several optimizations for the Number Theoretic Transform (NTT) used for polynomial multiplication. More concretely, we describe the Move-and-Add (MA) and the Shift-Add-Multiply-Subtract-Subtract (SAMS2) technique to speed up the performance-critical multiplication and modular reduction of coefficients, respectively. We take advantage of incompletely-reduced intermediate results to minimize the total number of reduction operations and use a special coefficient-storage method to decrease the RAM footprint of NTT multiplications. In addition, we propose a byte-wise scanning strategy to improve the performance of a discrete Gaussian sampler based on the Knuth-Yao random walk algorithm. For medium-term security, our ring-LWE implementation needs 590 k, 672 k, and 276 k clock cycles for key-generation, encryption, and decryption, respectively. On the other hand, for long-term security, the execution time of key-generation, encryption, and decryption amount to 2.2 M, 2.6 M, and 686 k cycles, respectively. These results set new speed records for ring-LWE encryption on an 8-bit processor and outperform related RSA and ECC implementations by an order of magnitude.

show abstract

Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based FPGAs for Area and Speed

Roy

Rebeiro

Mukhopadhyay

2013

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

HEAWS: An Accelerator for Homomorphic Encryption on the Amazon AWS FPGA

Turan

Roy

Verbauwhede

2020

IEEE Trans. Comput.

View full text Add to dashboard Cite

Homomorphic Encryption makes privacy preserving computing possible in a third party owned cloud by enabling computation on the encrypted data of users. However, software implementations of homomorphic encryption are very slow on general purpose processors. With the emergence of 'FPGAs as a service', hardware-acceleration of computationally heavy workloads in the cloud are getting popular. In this paper we propose HEAWS, a domain-specific coprocessor architecture for accelerating homomorphic function evaluation on the encrypted data using high-performance FPGAs available in the Amazon AWS cloud. To the best of our knowledge, we are the first to report hardware acceleration of homomorphic encryption using Amazon AWS FPGAs. Utilizing the massive size of the AWS FPGAs, we design a high-performance and parallel coprocessor architecture for the FV homomorphic encryption scheme which has become popular for computing exact arithmetic on the encrypted data. We design parallel building blocks and apply pipeline processing at different levels of the implementation hierarchy, and on top of such optimizations we instantiate multiple parallel coprocessors in the FPGA to execute several homomorphic computations simultaneously. While the absolute computation time can be reduced by deploying more computational resources, efficiency of the HW/SW communication interface plays an important role in homomorphic encryption as it is computation as well as data intensive. Our implementation utilizes state of the art 512-bit XDMA feature of high bandwidth communication available in the AWS Shell to reduce the overhead of HW/SW data transfer. Moreover, we explore the design-space to identify optimal off-chip data transfer strategy for feeding the parallel coprocessors in a time-shared manner. As a result of these optimizations, our AWS-based accelerator can perform 613 homomorphic multiplications per second for a parameter set that enables homomorphic computations of depth 4. Finally, we benchmark an artificial neural network for privacy-preserving forecasting of energy consumption in a Smart Grid application and observe five times speed up.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sujoy Sinha Roy

Saber: Module-LWR Based Key Exchange, CPA-Secure Encryption and CCA-Secure KEM

Compact Ring-LWE Cryptoprocessor

FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data

A Masked Ring-LWE Implementation

High Precision Discrete Gaussian Sampling on FPGAs

Efficient Ring-LWE Encryption on 8-Bit AVR Processors

Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based FPGAs for Area and Speed

HEAWS: An Accelerator for Homomorphic Encryption on the Amazon AWS FPGA

Contact Info

Product

Resources

About