Samuel Antão scite author profile

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing 10 18 floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

show abstract

RNS-Based Elliptic Curve Point Multiplication for Massive Parallel Architectures

Antão

Bajard

Sousa

2011

The Computer Journal

View full text Add to dashboard Cite

Acceleration of cryptographic applications on massive parallel computing platforms, such as Graphic Processing Units (GPUs), becomes a real challenge concerning practical implementations. In this paper, we propose a parallel algorithm for Elliptic Curve (EC) point multiplication in order to compute EC cryptography on these platforms. The proposed approach relies on the usage of the Residue Number System (RNS) to extract parallelism on high-precision integer arithmetic. Results suggest a maximum throughput of 9827 EC multiplications per second and minimum latency of 29.2 ms for a 224-bit underlying field, in a commercial Nvidia 285 GTX GPU. Performances up to an order of magnitude better in latency and 122% in throughput are achieved regarding other approaches reported in the related art. An experimental analysis of the scalability, based on OpenCL descriptions of the proposed algorithms, suggest that further advantage can be obtained from the proposed RNS approach for GPUs and EC curves supported by underlying finite fields of smaller size, regarding implementations on general purpose multi-cores.

show abstract

Data access optimization in a processing-in-memory system

Sura

Jacob

Chen

et al. 2015

View full text Add to dashboard Cite

The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.

show abstract

Combining Residue Arithmetic to Design Efficient Cryptographic Circuits and Systems

Sousa

Antão

Martins

2016

IEEE Circuits Syst. Mag.

View full text Add to dashboard Cite

Offloading Support for OpenMP in Clang and LLVM

Antão

Bataev

Jacob

et al. 2016

View full text Add to dashboard Cite

Integrating GPU support for OpenMP offloading directives into Clang

Bertolli

Antão

Bercea

et al. 2015

View full text Add to dashboard Cite

Elliptic Curve point multiplication on GPUs

Antão

Bajard

Sousa

2010

View full text Add to dashboard Cite

Coordinating GPU Threads for OpenMP 4.0 in LLVM

Bertolli

Antão

Eichenberger

et al. 2014

View full text Add to dashboard Cite

GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs.One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure. Algorithms LLVM Compiler Infrastructure in HPC978-1-4799-7023-0/14 $31.00

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Samuel Antão

Active Memory Cube: A processing-in-memory architecture for exascale systems

RNS-Based Elliptic Curve Point Multiplication for Massive Parallel Architectures

Data access optimization in a processing-in-memory system

Combining Residue Arithmetic to Design Efficient Cryptographic Circuits and Systems

Offloading Support for OpenMP in Clang and LLVM

Integrating GPU support for OpenMP offloading directives into Clang

Elliptic Curve point multiplication on GPUs

Coordinating GPU Threads for OpenMP 4.0 in LLVM

Contact Info

Product

Resources

About