Comparison of parallelized radix-2 and radix-4 scalable Montgomery multipliers

Carter, Andrew; Ning, Paula; Koven, William; Harris, David; Braly, Michael; Jones, Nathan; Massas, Julien; Murakami, Trevin; Simoni, Alexandra; Mathew, Sanu

doi:10.1109/acssc.2013.6810473

Cited by 1 publication

(1 citation statement)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In each kernel cycle of Montgomery modular multiplication with [19], it shifts operands

M

and

A

1 bit left

n/w

times with temporary results

Z:= Z+ q_i\cdot M+ b_i\cdot A

, and then right shift the result

n/w

bits 1 time at the end of the kernel cycle with

Z:= Z/2^{n/w}

[20], that is, performing the division of

2^{n/w}

(Z+ q_i\cdot M+ b_i\cdot A)/2^{n/w}

from serial operation to one‐step operation. In this way, the latency between processing elements is decreased from 2 clock cycles to 1 clock cycle.…”

Section: Hardware Implementationmentioning

confidence: 99%

Radix‐16 CSA‐based low‐latency non‐Montgomery modular multiplier

Wu¹

2021

The Journal of Engineering

View full text Add to dashboard Cite

The long-precision modular multiplier is usually performed by Montgomery algorithm with bit shifts, yielding a result in the form of A ⋅ B ⋅ 2 −n mod M . In fact, with the precomputation of 𝛿 ⋅ 2 n mod M proposed by others, it is able to compute classic modular multiplication A ⋅ B mod M rather than Montgomery modular multiplication A ⋅ B ⋅ N −1 mod M . In this work, a multi-bit CSA-based long-precision modular multiplier is proposed for hardware implementation. It includes two carry save adders and one RAM access to perform modular multiplications. After every modular multiplication, a division is applied to reduce the result down to [0, M ). Hardware implementation results show that a 1024-bit modular multiplication can be completed in 2.39 𝜇s on the XC5V FPGA platform, costing 5793 slices, 8 DSPs and 36 BRAMs, which is a promising candidate to compute incontinuous classic modular multiplications.

show abstract

“…In each kernel cycle of Montgomery modular multiplication with [19], it shifts operands

M

and

A

1 bit left

n/w

times with temporary results

Z:= Z+ q_i\cdot M+ b_i\cdot A

, and then right shift the result

n/w

bits 1 time at the end of the kernel cycle with

Z:= Z/2^{n/w}

[20], that is, performing the division of

2^{n/w}

(Z+ q_i\cdot M+ b_i\cdot A)/2^{n/w}

from serial operation to one‐step operation. In this way, the latency between processing elements is decreased from 2 clock cycles to 1 clock cycle.…”

Section: Hardware Implementationmentioning

confidence: 99%

Radix‐16 CSA‐based low‐latency non‐Montgomery modular multiplier

Wu¹

2021

The Journal of Engineering

View full text Add to dashboard Cite

show abstract

Comparison of parallelized radix-2 and radix-4 scalable Montgomery multipliers

Cited by 1 publication

References 7 publications

Radix‐16 CSA‐based low‐latency non‐Montgomery modular multiplier

Radix‐16 CSA‐based low‐latency non‐Montgomery modular multiplier

Contact Info

Product

Resources

About