“…In each kernel cycle of Montgomery modular multiplication with [
19], it shifts operands
and
1 bit left
times with temporary results
, and then right shift the result
bits 1 time at the end of the kernel cycle with
[
20], that is, performing the division of
in
from serial operation to one‐step operation. In this way, the latency between processing elements is decreased from 2 clock cycles to 1 clock cycle.…”