Efficient Implementation of the Backpropagation Algorithm in FPGAs and Microcontrollers

Ortega-Zamorano, Francisco; Jerez, José M.; Munoz, Daniel; Luque‐Baena, Rafael Marcos; Franco, Leonardo

doi:10.1109/tnnls.2015.2460991

Cited by 69 publications

(49 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It can be appreciated from the error curves that for the FPGA implementation case some larger oscillations appear, and this is due to rounding effects because of the size of the fixed point representation used. In terms of the level of prediction accuracy obtained these oscillations do not degrade it, and on the contrary in some cases even leads to larger values, as it has been observed previously in FPGA implementations [31,33], and in several works where it was concluded that certain level of noise might be beneficial for improving learning times, fault tolerance and prediction accuracy [1,15,26]. Table 3 shows the generalization ability obtained for several architectures with different numbers of hidden layers for the FPGA and MC implementations.…”

Section: Resultssupporting

confidence: 76%

“…The first column indicates the number of hidden layer present in the architecture, the second column shows the generalization obtained using the MC implementation (mean and standard deviation computed over 100 independent runs using C code), while third and fourth columns shows the results for two different FPGA implementations: the layer multiplexing scheme proposed in this work and the fixed layer scheme utilized in Ref. [31] (only available for architectures with one and two hidden layers). The number of neurons in each of the hidden layers was fixed to five and the number of epochs set to 1000.…”

Section: Resultsmentioning

confidence: 99%

“…The first three columns indicate the data set name, number of inputs and outputs respectively, while the last two columns shows the generalization ability obtained using neural network architectures with 5 neurons in the single hidden layer. This choice of number of neurons permits the comparison with published results [31]. For carrying out the simulations a training, validation and test sets splitting was used in a 50-20-30% scheme; in which the validation set was used to find the number of epochs for evaluating the test error, the maximum number of epochs was set to 1000, and the learning rate was equal to 0.2.…”

Section: Resultsmentioning

confidence: 99%

“…On the other hand on-chip learning implementations includes both training and execution phases of the algorithm [6,32,41] permitting the whole process to be carried out in the FPGA board independently of an external device. Existing specific implementations of the artificial neural network Back-Propagation algorithm in FPGA boards include the works of [9,29,31,39]. In all of these works, the neural network architecture is previously prefixed by the designer, as the number of neurons and hidden layers is limited by the FPGA resources available.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Layer multiplexing FPGA implementation for deep back-propagation learning

Ortega-Zamorano

Jerez

Gómez

et al. 2017

ICA

Self Cite

View full text Add to dashboard Cite

Training of large scale neural networks, like those used nowadays in Deep Learning schemes, requires long computational times or the use of high performance computation solutions like those based on cluster computation, GPU boards, etc. As a possible alternative, in this work the Back-Propagation learning algorithm is implemented in an FPGA board using a multiplexing layer scheme, in which a single layer of neurons is physically implemented in parallel but can be reused any number of times in order to simulate multi-layer architectures. An on-chip implementation of the algorithm is carried out using a training/validation scheme in order to avoid overfitting effects. The hardware implementation is tested on several configurations, permitting to simulate architectures comprising up to 127 hidden layers with a maximum number of neurons in each layer of 60 neurons. We confirmed the correct implementation of the algorithm and compared the computational times against C and Matlab code executed in a multicore supercomputer, observing a clear advantage of the proposed FPGA scheme. The layer multiplexing scheme used provides a simple and flexible approach in comparison to standard implementations of the Back-Propagation algorithm representing an important step towards the FPGA implementation of deep neural networks, one of the most novel and successful existing models for prediction problems.

show abstract

Section: Resultssupporting

confidence: 76%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Layer multiplexing FPGA implementation for deep back-propagation learning

Ortega-Zamorano

Jerez

Gómez

et al. 2017

ICA

Self Cite

View full text Add to dashboard Cite

show abstract

“…The key operations for the training and inference processes of DNNs are the vector‐matrix product, nonlinear function execution, and weights matrix update, while SNNs require spiking neurons and synaptic devices. To accelerate DNNs, various computing systems have been designed as DL accelerator (DLA), for instance, the FPGA based platforms, ASIC based TPU, DianNao, etc . These DLAs use a novel computing architecture to expedite training or the inference process of DNNs.…”

Section: Introductionmentioning

confidence: 99%

Neuromorphic Computing with Memristor Crossbar

Zhang

Huang

et al. 2018

Physica Status Solidi (a)

View full text Add to dashboard Cite

Neural networks, one of the key artificial intelligence technologies today, have the computational power and learning ability similar to the brain. However, implementation of neural networks based on the CMOS von Neumann computing systems suffers from the communication bottleneck restricted by the bus bandwidth and memory wall resulting from CMOS downscaling. Consequently, applications based on large-scale neural networks are energy/ area hungry and neuromorphic computing systems are proposed for efficient implementation of neural networks. Neuromorphic computing system consists of the synaptic device, neuronal circuit, and neuromorphic architecture. With the two-terminal nonvolatile nanoscale memristor as the synaptic device and crossbar as parallel architecture, memristor crossbars are proposed as a promising candidate for neuromorphic computing. Herein, neuromorphic computing systems with memristor crossbars are reviewed. The feasibility and applicability of memristor crossbars based neuromorphic computing for the implementation of artificial neural networks and spiking neural networks are discussed and the prospects and challenges are also described.

show abstract

Deep reinforcement learning‐based autonomous parking design with neural network compute accelerators

Özeloğlu

Gürbüz

San

2021

Concurrency and Computation

View full text Add to dashboard Cite

We describe the design and implementation of an autonomous prototype vehicle which finds an empty parking slot in a parking area, and parks itself in the empty parking slot, using neural networks based on deep reinforcement learning (RL). To perform an autonomous parking procedure for our prototype vehicle, two different artificial neural networks (ANNs) are trained using a deep RL Algorithm in a simulation environment and embedded into the computing platform of the prototype car.One of the ANNs enables the vehicle to drive autonomously in the parking environment. At the same time, an image processing algorithm is used to determine whether a parking slot is empty. When the image processing algorithm finds a suitable parking slot, a different ANN is activated and performs a safe parking procedure. However, ANN-based machine learning techniques require high processing power and impose a high computational burden on embedded CPU and GPU platforms. To alleviate the computational burden, one can achieve higher performance and less power consumption using an application-specific hardware design, where logic resources are fully exploited according to the algorithm of interest, in an energy-efficient manner. In this article, hardware accelerators for our ANN models are designed and generated via the Vivado high-level synthesis (HLS) tool, targeting an ARM based programmable SoC platform, ZedBoard. Our ANN accelerators have achieved a speedup of 17x as compared to an ARM software implementation. For deeper fully-connected layers used in deep RL-based solutions, function-level parallelism (Vivado's dataflow) is employed to improve the computational efficiency. Our proposed stage-level description for fully connected layers outperforms recent studies in terms of computation time.

show abstract

Efficient Implementation of the Backpropagation Algorithm in FPGAs and Microcontrollers

Cited by 69 publications

References 34 publications

Layer multiplexing FPGA implementation for deep back-propagation learning

Layer multiplexing FPGA implementation for deep back-propagation learning

Neuromorphic Computing with Memristor Crossbar

Deep reinforcement learning‐based autonomous parking design with neural network compute accelerators

Contact Info

Product

Resources

About