Training Deep Neural Networks in Low-Precision with High Accuracy Using FPGAs

Fox, Sean; Faraone, Julian; Boland, David; Vissers, Kees; Leong, Philip H. W.

doi:10.1109/icfpt47387.2019.00009

Cited by 23 publications

(26 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, even with cloud-level resources, reduced precision and pruning approaches have also been utilized to decrease computation intensity and communication bottleneck. Although quantization adopted in prior training accelerators [4,18] led to remarkable beneits in terms of resource usage and power consumption, these works have not provided any evidence that such quantization techniques can remain high accuracy on a large dataset (e.g. ImageNet) with dense neural networks.…”

Section: Related Workmentioning

confidence: 99%

“…), the on-chip memory of an edge FPGA is not big enough to hold weights or features in every Conv layer. Therefore, several works [4,18,20] applied quantization or pruning to reduce of-chip memory access. However, unlike inference where compressed networks cause little accuracy decrease [7], these training works have not proved that their compression techniques can remain high accuracy on large datasets with dense networks.…”

Section: Andmentioning

confidence: 99%

See 1 more Smart Citation

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Tang

Zhang

Zhou

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Andmentioning

confidence: 99%

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Tang

Zhang

Zhou

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

show abstract

“…This is not critical anyway in most machine learning applications (e.g., ANN), where a relatively reduced set of output categories or classes must be discriminated based on generic similarities. Indeed, most of the current machine learning applications use a small number of bits to represent digitized signals [41] because the difference in the final result between using high-precision floatingpoint signals and low-precision 8/16-bit signals is negligible [42]. Hence, the integration time used to evaluate the result of stochastic operations can be considerably reduced since it is exponentially dependent on the bit precision.…”

Section: Artificial Neural Network Applied To Virtual Screeningmentioning

confidence: 99%

Using Stochastic Computing for Virtual Screening Acceleration

et al. 2021

View full text Add to dashboard Cite

Stochastic computing is an emerging scientific field pushed by the need for developing high-performance artificial intelligence systems in hardware to quickly solve complex data processing problems. This is the case of virtual screening, a computational task aimed at searching across huge molecular databases for new drug leads. In this work, we show a classification framework in which molecules are described by an energy-based vector. This vector is then processed by an ultra-fast artificial neural network implemented through FPGA by using stochastic computing techniques. Compared to other previously published virtual screening methods, this proposal provides similar or higher accuracy, while it improves processing speed by about two or three orders of magnitude.

show abstract

“…Such a quantization greatly reduces the model size and computational complexity, making it suitable for hardware implementation. S. Fox et al [32] implemented a training accelerator based on 8-bit integer operations. It processes the forward and backward computations on FPGA with 8-bit integers, while the weight update computation is processed in full-precision on an ARM processor.…”

Section: Related Workmentioning

confidence: 99%

“…L. Yang et al [33] implemented a binarized neural network (BNN) on FPGA, which replaces the original binary convolution layer with two parallel binary convolutional layers for fast inference. These previous studies [22][23][24][25][26][27][28][29][30][31][32][33] follow the sequential order in processing according to the gradient descent algorithm.…”

Section: Related Workmentioning

confidence: 99%

Speculative Backpropagation for CNN Parallel Training

Park

Suh

2020

IEEE Access

View full text Add to dashboard Cite

The parallel learning in neural networks can greatly shorten the training time. Its prior efforts were mostly limited to distributing inputs to multiple computing engines. It is because the gradient descent algorithm in the neural network training is inherently sequential. This paper proposes a novel CNN parallel training method for image recognition. It overcomes the sequential property of the gradient descent and enables the parallel training with the speculative backpropagation. We found that the Softmax and ReLU outcomes in the forward propagation for the same labels are likely to be very similar. This characteristic makes it possible to perform the forward and backward propagation simultaneously. We implemented the proposed parallel model with CNNs in both software and hardware, and evaluated its performance. The parallel training reduces the training time by 34% in CIFAR-100 without the loss of the prediction accuracy compared to the sequential training. In many cases, it even improves the accuracy.

show abstract

Training Deep Neural Networks in Low-Precision with High Accuracy Using FPGAs

Cited by 23 publications

References 10 publications

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Using Stochastic Computing for Virtual Screening Acceleration

Speculative Backpropagation for CNN Parallel Training

Contact Info

Product

Resources

About