Fabian Schuiki scite author profile

In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GLOBALFOUNDRIES 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 × 256 double precision matrix multiplication on sixteen lanes. Ara runs at 1.2 GHz in the typical corner (TT/0.80 V/25 • C), achieving a performance up to 34 DP−GFLOPS. In terms of energy efficiency, Ara achieves up to 67 DP−GFLOPS/W under the same conditions, which is 56% higher than similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

show abstract

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Schuiki

Schaffner

Gürkaynak

et al. 2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7× over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7× energy efficiency improvement of NTX over contemporary GPUs at 4.4× less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95% parallel and energy efficiency, while providing 2.1× energy savings or 3.1× performance improvement over a GPU-based system.

show abstract

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Zaruba¹,

Schuiki²,

Hoefler

et al. 2021

IEEE Trans. Comput.

View full text Add to dashboard Cite

Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing

2021

View full text Add to dashboard Cite

FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

Mach

Schuiki

Zaruba

et al. 2021

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Schuiki¹,

Zaruba²,

Hoefler

et al. 2021

IEEE Trans. Comput.

View full text Add to dashboard Cite

Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11% increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100% in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.

show abstract

A 0.80pJ/flop, 1.24Tflop/sW 8-to-64 bit Transprecision Floating-Point Unit for a 64 bit RISC-V Processor in 22nm FD-SOI

Mach¹,

Schuiki²,

Zaruba³

et al. 2019

View full text Add to dashboard Cite

The crisis of Moore's law and new dominant Machine Learning workloads require a paradigm shift towards finely tunable-precision (a.k.a. transprecision) computing. More specifically, we need floating-point circuits that are capable to operate on many formats with high flexibility. We present the first silicon implementation of a 64-bit transprecision floatingpoint unit. It fully supports the standard double, single, and half precision, alongside custom bfloat and 8 bit formats. Operations occur on scalars or 2, 4, or 8-way SIMD vectors. We have integrated the 247 kGE unit into a 64 bit application-class RISC-V processor core, where the added transprecision support accounts for an energy and area overhead of merely 11% and 9%, respectively; yet achieving speedups and per-datum energy gains of 7.3x and 7.94x. We implemented the design in a 22 nm FD-SOI technology. The unit achieves energy efficiencies between 75 Gflop/sW and 1.24 Tflop/sW, and a performance between 1.85 Gflop/s and 14.83 Gflop/s, across formats.

show abstract

LLHD: a multi-level intermediate representation for hardware description languages

Schuiki

Kurth

Grosser

et al. 2020

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Fabian Schuiki

Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing

FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

A 0.80pJ/flop, 1.24Tflop/sW 8-to-64 bit Transprecision Floating-Point Unit for a 64 bit RISC-V Processor in 22nm FD-SOI

LLHD: a multi-level intermediate representation for hardware description languages

Contact Info

Product

Resources

About