Montgomery Multiplication Using Vector Instructions

Bos, Joppe W.; Montgomery, Peter L.; Shumow, Daniel; Zaverucha, Gregory M.

doi:10.1007/978-3-662-43414-7_24

Cited by 24 publications

(41 citation statements)

References 28 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [12], Pabbuleti et al implemented the NIST-recommended prime-field curve including P192 and P224 on the Snapdragon APQ8060 within 404, 405 clock cycles via applying multiplicand reduction method into SIMD-based machine. Recently, in SAC'13, a different approach to split the Montgomery multiplication into two parts, being computed in parallel, was introduced [6]. They flip the sign of the precomputed Montgomery constant and accumulate the result in two separate intermediate values that are computed concurrently while avoiding a redundant representation.…”

Section: Previous Workmentioning

confidence: 99%

“…Firstly, we re-organized operands by conducting transpose operation, which can efficiently shuffle inner vector by 32-bit wise. Instead of a normal order ((B[0], B [1]), (B [2], B [3]), (B [4], B [5]), (B [6], B [7])), we actually classify the operand as groups ((B[0], B [4]), (B [2], B [6]), (B [1], B [5]), (B [3], B [7])) for computing multiplication where each operand ranges from 0 to 2 32 − 1(0xffff ffff in hexadecimal form). Secondly, multiplication [7])) where the results are located from 0 to 2 64 −2 33 +1(0xffff fffe 0000 0001).…”

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

“…In [6], Bos et al introduced a 2-way Montgomery multiplication for SIMD architecture. However, the proposed 2-way Montgomery multiplication has high data interdependency because they used ordinary operand-scanning method for multiplication and reduction procedures which compute partial products in incremental order and previous partial product results are directly used in next step.…”

Section: Coarsely Integrated Cascade Operand Scanning Multiplication mentioning

confidence: 99%

“…After then the higher bits are added to lower bits of upper intermediate results. For example, higher bits of (C[0], C [4]), (C [2], C [6]), (C [1], C [5]), (C [3]) are added to lower bits of (C [1], C [5]), (C [3], C [7]), (C [2], C [6]), (C [4]). These intermediate results are placed between 0 and 2 33 −2(0x1 ffff fffe) 11 .…”

Section: This Process Takes ((B[0] B[1]) (B[2] B[3]) (B[4] B[5])mentioning

confidence: 99%

“…On the other hand, a conventional non-redundant representation based on a radix of 2 32 reduces the number of partial products to only 6 × 6 = 36. At SAC 2013, Bos et al introduced a 2-way Montgomery multiplication for SIMD processors including ARM NEON [6]. Their implementation computes the multiplication and reduction operation simultaneously using a non-redundant representation, which allowed them to exploit the SIMD-level parallelism provided by the NEON engine.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Montgomery Modular Multiplication on ARM-NEON Revisited

Seo

Liu

Großschädl

et al. 2015

Information Security and Cryptology - ICISC 2014

View full text Add to dashboard Cite

Abstract. Montgomery modular multiplication constitutes the "arithmetic foundation" of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a "non-canonical" representation of operands with a reduced radix. We show that two COS computations can be "coarsely" integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al's implementation from SAC 2013 by up to 57% (A9) and 40% (A15), respectively.

show abstract

Section: Previous Workmentioning

confidence: 99%

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

Section: Coarsely Integrated Cascade Operand Scanning Multiplication mentioning

confidence: 99%

Section: This Process Takes ((B[0] B[1]) (B[2] B[3]) (B[4] B[5])mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Montgomery Modular Multiplication on ARM-NEON Revisited

Seo

Liu

Großschädl

et al. 2015

Information Security and Cryptology - ICISC 2014

View full text Add to dashboard Cite

show abstract

Efficient arithmetic on ARM‐NEON and its application for high‐speed RSA implementation

Seo

Liu

Großschädl

et al. 2016

Security Comm Networks

View full text Add to dashboard Cite

Advanced modern processors support single instruction, multiple data instructions (e.g., Intel‐AVX and ARM‐NEON) and a massive body of research on vector‐parallel implementations of modular arithmetic, which are crucial components for modern public‐key cryptography ranging from Rivest, Shamir, and Adleman (RSA), ElGamal, Digital Signature Algorithm (DSA), and elliptic curve cryptography, have been conducted. In this paper, we introduce a novel double operand scanning method to speed up multi‐precision squaring with non‐redundant representations on single instruction, multiple data architecture where the part of the operands are doubled to compute the squaring operation without read‐after‐write dependencies between source and destination variables. Afterwards, Karatsuba algorithm is applied to both multiplication and squaring operations. For modular multiplication, separated Montgomery algorithm is chosen. Finally, the Rivest, Shamir, and Adleman (RSA) implementations outperform the best‐known results on the ARM‐NEON platforms. Copyright © 2017 John Wiley & Sons, Ltd.

show abstract