Modular multiplication is a core operation in all public key based cryptosystems. The performance of these cryptosystems can be enhanced substantially by incorporating an optimized modular multiplier. This paper presents serial and parallel radix-4 modular multipliers based on interleaved multiplication algorithm and Montgomery power laddering technique. A serial radix-4 interleaved modular multiplier provides 50% reduction in the required clock cycles. In addition to the reduction in clock cycles, a parallel modular multiplier maintains a critical path delay comparable to the bit serial interleaved multipliers. The proposed designs are implemented in Verilog HDL and synthesized targeting virtex-6 FPGA platform using Xilinx ISE 14.2 Design suite. The serial radix-4 multiplier computes a 256-bit modular multiplication in 1.3µs, occupies 3.9K LUTs, and runs at 96 MHz. The parallel radix-4 multiplier takes 0.77µs, occupies 5.3K LUTs, and runs at 166 MHz. The results show that the parallel radix-4 modular multiplier provides 62% and 49% speed-up over the corresponding bit serial and bit parallel versions, respectively. Thus, these designs are suitable to accelerate modular multiplication in many cryptographic processors.Index Terms-Finite field, elliptic curve cryptography (ECC), interleaved multiplication, public key cryptography (PKC).
I. INTROUCTIONModular multiplication is a tedious operation that is extensively used in a variety of public key cryptographic schemes such as RSA, elliptic curve [1], [2] (ECC). Elliptic curve based cryptographic schemes enjoy much smaller key sizes as compared to RSA, which led to better bandwidth utilization, less storage requirements and lower power consumption. To achieve 128-bit advanced encryption standard (AES) security level, finite field operations in ECC is around 256-bits. Due to its computational complexity, a dedicated hardware implementation is essential to meet timing constraints of many real time applications.For speeding up modular multiplication operation several designs have been presented. These designs can be classified into three categories: designs based on NIST recommended primes [3], designs based on interleaved multiplication algorithm [4] and designs based on Montgomery multiplication algorithm [5]. A pipelined modular multiplier design reported in [6] can support five NIST recommended primes. Its datapath is comprised of 8 pipeline stages with a latency of 80 ns for prime of sizes 192, 224, 256-bits and 200 ns for 384, 256bits. It consumes 8340 slices and 259 dedicated DSPs blocks on Virtex-6 FPGA platform, which may not fit into smaller FPGAs, but is suitable for high speed applications. Designs reported in [7], [8] also exploited special structure of NIST primes. These implementations are devoted to 224 and 256bits and are not able to provide flexibility feature, which may be desirable in many applications.