RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Tortorella, Yvan; Bertaccini, Luca; Rossi, Davide; Benini, Luca; Conti, Francesco

doi:10.23919/date54114.2022.9774759

Cited by 10 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2). TPEs focus on accelerating matrix multiplication of the kind D = A × B + C, exploiting an internal high-efficiency systolic structure extending RedMulE [18,33], an open-source systolic array with multi-precision Fused Multiply-Add Modules that achieves up to 920 GFLOPS/W when operating on FP8 inputs with FP16 accumulators and 775 GFLOPS/W on full FP16. ISOLDE aims at further extending the TPE capabilities in several directions: more internal and input/output formats; tight integration with the RISC-V CVA6 cores to enable TPE utilization within performance-critical software code; larger performance gains; and better integration with software.…”

Section: Hardware Acceleratorsmentioning

confidence: 99%

RISC-V Processor Technologies for Aerospace Applications in the ISOLDE Project

Fornaciari,

Reghenzani,

Agosta

et al. 2023

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Section: Hardware Acceleratorsmentioning

confidence: 99%

RISC-V Processor Technologies for Aerospace Applications in the ISOLDE Project

Fornaciari,

Reghenzani,

Agosta

et al. 2023

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

“…Figure 2 shows that a PULP cluster can be enhanced by introducing application-specific hardware accelerators that provide higher efficiency and performance during specific kernel execution over general-purpose cores. Since intense matrix multiplications are widespread in machine learning and deep learning, we integrate a Reduced-Precision Matrix Multiplication Engine, RedMulE [40], to introduce up to 22× better performance and 5× better energy efficiency on the execution of 16-bits floating-point matrix multiplication kernels over the parallel execution on the general-purpose RISC-V cores. Furthermore, we enhanced RedMulE with fault-tolerant capabilities to tackle safety-critical satellite onboard computing.…”

Section: Redundant Hardware Acceleratorsmentioning

confidence: 99%

PULP Fiction No More—Dependable PULP Systems for Space

Ulbricht

Tortorella

Rogenmoser³

et al. 2023

2023 IEEE European Test Symposium (ETS)

Self Cite

View full text Add to dashboard Cite

Due to their flexibility and openness, the RISC-V ISA and processor architectures have emerged as notable contenders in various application domains. Their advantages over commercial solutions have attracted the interest of academia and industry and even led to their planned adoption in aeronautics and space. However, in these demanding environments, system reliability is of paramount importance. To address this issue, this paper presents an overview of several hardware-centric approaches for developing reliable systems based on the parallelultra low power (PULP) open-source RISC-V hardware platform. These approaches range from gate-level optimizations to systemlevel improvements and highlight the versatility of the PULP architecture and its potential as a viable architecture for developing various aerospace platforms.

show abstract

“…The Tensor Product Engine (TPE) [29] accelerates matrix multiplications (MatMuls) of the kind Z = X • W . It is designed to use the IEEE 754 binary-16 representation (FP16 in the following) since it is understood that FP16 can be used to train Neural Networks without significant accuracy loss, but reducing the power consumption and time to computation [30].…”

Section: Tensor Product Enginementioning

confidence: 99%

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

Garofalo

Tortorella

Perotti³

et al. 2022

IEEE Open J. Solid-State Circuits Soc.

Self Cite

View full text Add to dashboard Cite

On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a Systemon-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16b floating point Tensor Product Engine (TPE) for tiled matrixmultiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floatingpoint tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency -enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.

show abstract

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Cited by 10 publications

References 18 publications

RISC-V Processor Technologies for Aerospace Applications in the ISOLDE Project

RISC-V Processor Technologies for Aerospace Applications in the ISOLDE Project

PULP Fiction No More—Dependable PULP Systems for Space

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

Contact Info

Product

Resources

About