Samuel Antão scite author profile

Bajard

2011

The Computer Journal

Acceleration of cryptographic applications on massive parallel computing platforms, such as Graphic Processing Units (GPUs), becomes a real challenge concerning practical implementations. In this paper, we propose a parallel algorithm for Elliptic Curve (EC) point multiplication in order to compute EC cryptography on these platforms. The proposed approach relies on the usage of the Residue Number System (RNS) to extract parallelism on high-precision integer arithmetic. Results suggest a maximum throughput of 9827 EC multiplications per second and minimum latency of 29.2 ms for a 224-bit underlying field, in a commercial Nvidia 285 GTX GPU. Performances up to an order of magnitude better in latency and 122% in throughput are achieved regarding other approaches reported in the related art. An experimental analysis of the scalability, based on OpenCL descriptions of the proposed algorithms, suggest that further advantage can be obtained from the proposed RNS approach for GPUs and EC curves supported by underlying finite fields of smaller size, regarding implementations on general purpose multi-cores.

show abstract

Data access optimization in a processing-in-memory system

Sura

Jacob

Chen

et al. 2015

The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.

show abstract

Combining Residue Arithmetic to Design Efficient Cryptographic Circuits and Systems

IEEE Circuits Syst. Mag.

Martins

2016

Offloading Support for OpenMP in Clang and LLVM

Bataev

Jacob

et al. 2016

Elliptic Curve point multiplication on GPUs

Bajard

2010

Integrating GPU support for OpenMP offloading directives into Clang

Bertolli

Bercea

et al. 2015

MRC-Based RNS Reverse Converters for the Four-Moduli Sets $\{2^{n} + 1, 2^{n} - 1, 2^{n}, 2^{2n + 1} - 1\}$ and $ \{2^{n} + 1, 2^{n} - 1, 2^{2n}, 2^{2n + 1} - 1\}$

IEEE Trans. Circuits Syst. II

2012