Custom 8-bit floating point value format for reducing shared memory bank conflict in approximate nearest neighbor search

Ootomo, Hiroyuki; Naruse, Akira

doi:10.48550/arxiv.2301.06672

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the machine could be synthesized using floating-point formats such as bfloat16, float16, and TensorFloat32 [44], by implementing the required logic in the PEs, and by adjusting, in general, memory size. It would be also possible to maintain memory capacity unvaried and store intermediate layer results using 8-bit floating point arithmetic, for instance e5m3 or e4m4 [45]. In particular, e5m3 can represent normalized float16 numbers with lower accuracy but enable a small overhead for float32 conversion.…”

Section: Single Perceptron Linear Vector Processor a High-level Archi...mentioning

confidence: 99%

An 8-bit Single Perceptron Processing Unit for Tiny Machine Learning Applications

Crepaldi,

Salvo,

Merello

2023

IEEE Access

View full text Add to dashboard Cite

We present a tiny MultiLayer Perceptron (MLP) accelerator named Single Perceptron Linear Vector Processor (SPLVP) that aims at extending the capabilities of limited resources MCUs, enabling inference time speedup and main CPU off-load. It is based on a single perceptron hardware unit, enhanced with an additional accumulation input and scaling features, that is sequentially scheduled to cover all the nodes of the network. The accelerator supports both linear and Rectified Linear Unit (ReLU) activation and its firmware can be generated from 8-bit tflite quantized models. We also present a complete design toolchain that encompasses supervised learning, compilation, assembly, simulation, and device programming. The hardware support for extra accumulation input and scaling, together with the processor memory partitioning, are the key features that enable significant speedups. By solving a toy recognition problem based on image data captured from an infra-red camera, measurements show that the execution speed of SPLVP at 80 MHz outperforms an ARM Cortex-M4 STM32L476 microcontroller by a factor of 9.2 when the same ANN is translated to MCU code using the STM CubeMX-Ai converter at the same clock frequency. SPLVP is synthesized on a low-cost and gate-count Cyclone 10 LP FPGA resulting in an 18% logic and 77% memory occupation. The SPLVP assembly code can be directly converted into a VHDL description that directly hardcodes the ANN. The execution speed of an ANN model for Iris classification, fully synthesized, improves by a factor of 209 compared to firmware execution on the MCU. To verify the operation of SPLVP and its design framework, we have designed various tiny Machine Learning (ML) classifiers, for which we briefly discuss the obtained performance and the preprocessing techniques used. Across all these classifiers, the obtained speedup compared to the STM32 is 8.3-14.9 ×. INDEX TERMSNeural processing unit, multilayer perceptron, single perceptron linear vector processor, fully connected neural network, FPGA, compiler, design toolchain, MCU, tiny machine learning.

show abstract

Section: Single Perceptron Linear Vector Processor a High-level Archi...mentioning

confidence: 99%

An 8-bit Single Perceptron Processing Unit for Tiny Machine Learning Applications

Crepaldi,

Salvo,

Merello

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

Custom 8-bit floating point value format for reducing shared memory bank conflict in approximate nearest neighbor search

Cited by 1 publication

References 0 publications

An 8-bit Single Perceptron Processing Unit for Tiny Machine Learning Applications

An 8-bit Single Perceptron Processing Unit for Tiny Machine Learning Applications

Contact Info

Product

Resources

About