FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit

Cho, Mannhee; Kim, Youngmin

doi:10.3390/electronics10222859

Cited by 16 publications

(6 citation statements)

References 24 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These include devices like the ARM Ethos NPU, BeagleBone AI, Intel Movidus NCS, NVIDIA Jetson NANO and many others. These hardware accelerators are computationally efficient, but not optimized for power consumption [8].…”

Section: Review Of Edge Computingmentioning

confidence: 99%

Real-time edge computing design for physiological signal analysis and classification

Suppiah,

Noori,

Abidi

et al. 2024

Biomed. Phys. Eng. Express

View full text Add to dashboard Cite

Physiological Signals like Electromography (EMG) and Electroencephalography (EEG) can be analysed and decoded to provide vital information that can be used in a range of applications like rehabilitative robotics and remote device control. The process of acquiring and using these signals requires many compute-intensive tasks like signal acquisition, signal processing, feature extraction and machine learning. Performing these activities on a PC-based system with well-established software tools like Python and Matlab is the first step in designing solutions based upon these signals. In the application domain of rehabilitative robotics, one of the main goals is to develop solutions that can be deployed for the use of individuals who need them in improving their Acitivities-for-Daily Living (ADL). To achieve this objective, the final solution must be deployed onto an embedded solution that allows high portability and ease-of-use. Porting a solution from a PC-based environment onto a resource-constraint one such as a microcontroller poses many challenges. In this research paper, we propose the use of a ARM-based Corex M-4 processor. We explore the various stages of the design from the initial testing and validation, to the deployment of the proposed algorithm on the controller, and further investigate the use of Cepstrum features to obtain a high classification accuracy with minimal input features. The proposed solution is able to achieve an average classification accuracy of 95.34% for all five classes in the EMG domain and 96.16% in the EEG domain on the embedded board.

show abstract

Section: Review Of Edge Computingmentioning

confidence: 99%

Real-time edge computing design for physiological signal analysis and classification

Suppiah,

Noori,

Abidi

et al. 2024

Biomed. Phys. Eng. Express

View full text Add to dashboard Cite

show abstract

“…The authors in [26] proposed an accelerator for LeNet-5 architecture to perform handwritten digits classification. The proposed strategy is based on three major aspects; loop parallelization to utilize resources, fixed-point data optimization to find the minimum number of bits that maintains accuracy level, and finally implementing MAC approximate units through logic blocks such as look-up tables (LUTs) and flip-flops (FFs) rather than using high-precision digital signal processors (DSPs).…”

Section: Related Workmentioning

confidence: 99%

Optimization of FPGA-based CNN accelerators using metaheuristics

2022

View full text Add to dashboard Cite

In recent years, convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields and with accuracy that was not possible before. However, this comes with extensive computational requirements, which made general central processing units (CPUs) unable to deliver the desired real-time performance. At the same time, field-programmable gate arrays (FPGAs) have seen a surge in interest for accelerating CNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt compared to other computing technologies such as graphics processing units (GPUs). The current trend in FPGAbased CNN accelerators is to implement multiple convolutional layer processors (CLPs), each of which is tailored for a subset of layers. However, the growing complexity of CNN architectures makes optimizing the resources available on the target FPGA device to deliver the optimal performance more challenging. This is because of the exponential increase in the design variables that must be considered when implementing a Multi-CLP accelerator as CNN's complexity increases. In this paper, we present a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for partitioning available FPGA resources to design a Multi-CLP accelerator. Specifically, the proposed design tool adopts simulated annealing (SA) and tabu search (TS) algorithms to find the number of CLPs required and their respective configurations to achieve optimal performance on a given target FPGA device. Here, the focus is on the key specifications and hardware resources, including digital signal processors (DSPs), block random-access memories (BRAMs), and off-chip memory bandwidth. Experimental results and comparisons using four well-known benchmark CNNs are presented demonstrating that the proposed acceleration framework is both encouraging and promising. The SA-/TS-based Multi-CLP achieves 1.31× − 2.37× higher throughput than the state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGGNet, and GoogLeNet architectures on the Xilinx VC707 and VC709 FPGA boards.

show abstract

“…Moreover, TABLE 6 summarizes the comparison with the state-of-the-art ANN training accelerators for MNIST classification [40][41][42]. Small networks are selected for ANNs to present a better comparison with our work.…”

Section: F Fractional Precisionmentioning

confidence: 99%

Efficient Neuromorphic Hardware Through Spiking Temporal Online Local Learning

Guo

Fouda

Eltawil

et al. 2022

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Local learning schemes have shown promising performance in spiking neural networks training and are considered a step toward more biologically plausible learning. Despite many efforts to design high-performance neuromorphic systems, a fast and efficient on-chip training algorithm is still missing, which limits the deployment of neuromorphic systems in many real-time applications. This work proposes a scalable, fast, and efficient spiking neuromorphic hardware system with on-chip local learning capability. We introduce an effective hardwarefriendly local training algorithm compatible with sparse temporal input coding and binary random classification weights. The algorithm is demonstrated to deliver competitive accuracy in different tasks. The proposed digital system explores spike sparsity in communication, parallelism in vector-matrix operations and process-level dataflow, and locality of training errors, which leads to low cost and fast training speed. The system is optimized under various performance metrics. Taking into consideration energy, speed, resources, and accuracy, the proposed method shows around 10× efficiency over a recent work with a direct feedback alignment method and 4.5× efficiency over the spike-timing-dependent plasticity method. Moreover, our hardware architecture can easily scale up with the network size at a linear rate. Thus, our method has demonstrated great potential for use in various applications, especially those demanding low latency.

show abstract

FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit

Cited by 16 publications

References 24 publications

Real-time edge computing design for physiological signal analysis and classification

Real-time edge computing design for physiological signal analysis and classification

Optimization of FPGA-based CNN accelerators using metaheuristics

Efficient Neuromorphic Hardware Through Spiking Temporal Online Local Learning

Contact Info

Product

Resources

About