In the Internet-Of-Things (IoT) domain, microcontrollers (MCUs) are used to collect and process data coming from sensors and transmit them to the cloud. Applications that require the range and precision of floating-point (FP) arithmetic can be implemented using efficient hardware floating-point units (FPUs) or by using software emulation. FPUs optimize performance and code size, whilst software emulation minimizes the hardware cost. We present a new area-optimized, IEEE 754-compliant RISC-V FPU (Tiny-FPU), and we explore the area, code size, performance, power, and energy efficiency of three different implementations of the RISC-V Instruction Set Architecture double and singleprecision FP extensions on an MCU-class processor. We show that Tiny-FPU, in its double and single-precision versions, is respectively 54% and 37% smaller than a double and singleprecision FPU optimized for performance and energy efficiency. When coupling a RISC-V core with Tiny-FPU, we achieve up to 18.5× and 15.5× speedups with respect to the same core emulating FP operations via software.
On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a Systemon-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16b floating point Tensor Product Engine (TPE) for tiled matrixmultiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floatingpoint tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency -enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.
Small, low-cost IoT devices rely on floating-point (FP) software emulation on 32-bit integer cores when the cost of a full-fledged FPU is not affordable. Thus, the performance and code size of the FP emulation library are decisive for meeting energy and memory-size constraints. We propose RVfplib, the first ISA-optimized open-source library for single and double-precision IEEE 754 FP emulation on RV32IM[C] cores. RVfplib is 59% smaller and 2x faster than the GCC emulation library, on average. On benchmark programs, code size reduction is 39%, and performance boost 1.5x. RVfplib is 5.3% smaller than the leading closed-source RISC-V commercial library.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.