2022 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE) 2022
DOI: 10.23919/date54114.2022.9774759
|View full text |Cite
|
Sign up to set email alerts
|

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Abstract: The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications' latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging. One of the key stumbling stones is the need for parallel floating-point operations, which are considered unaffordable on sub-100 mW extreme-edge SoCs. We tackle this proble… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

3
3

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 18 publications
0
2
0
Order By: Relevance
“…2). TPEs focus on accelerating matrix multiplication of the kind D = A × B + C, exploiting an internal high-efficiency systolic structure extending RedMulE [18,33], an open-source systolic array with multi-precision Fused Multiply-Add Modules that achieves up to 920 GFLOPS/W when operating on FP8 inputs with FP16 accumulators and 775 GFLOPS/W on full FP16. ISOLDE aims at further extending the TPE capabilities in several directions: more internal and input/output formats; tight integration with the RISC-V CVA6 cores to enable TPE utilization within performance-critical software code; larger performance gains; and better integration with software.…”
Section: Hardware Acceleratorsmentioning
confidence: 99%
“…2). TPEs focus on accelerating matrix multiplication of the kind D = A × B + C, exploiting an internal high-efficiency systolic structure extending RedMulE [18,33], an open-source systolic array with multi-precision Fused Multiply-Add Modules that achieves up to 920 GFLOPS/W when operating on FP8 inputs with FP16 accumulators and 775 GFLOPS/W on full FP16. ISOLDE aims at further extending the TPE capabilities in several directions: more internal and input/output formats; tight integration with the RISC-V CVA6 cores to enable TPE utilization within performance-critical software code; larger performance gains; and better integration with software.…”
Section: Hardware Acceleratorsmentioning
confidence: 99%
“…Figure 2 shows that a PULP cluster can be enhanced by introducing application-specific hardware accelerators that provide higher efficiency and performance during specific kernel execution over general-purpose cores. Since intense matrix multiplications are widespread in machine learning and deep learning, we integrate a Reduced-Precision Matrix Multiplication Engine, RedMulE [40], to introduce up to 22× better performance and 5× better energy efficiency on the execution of 16-bits floating-point matrix multiplication kernels over the parallel execution on the general-purpose RISC-V cores. Furthermore, we enhanced RedMulE with fault-tolerant capabilities to tackle safety-critical satellite onboard computing.…”
Section: Redundant Hardware Acceleratorsmentioning
confidence: 99%
“…The Tensor Product Engine (TPE) [29] accelerates matrix multiplications (MatMuls) of the kind Z = X • W . It is designed to use the IEEE 754 binary-16 representation (FP16 in the following) since it is understood that FP16 can be used to train Neural Networks without significant accuracy loss, but reducing the power consumption and time to computation [30].…”
Section: Tensor Product Enginementioning
confidence: 99%