LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications

Umuroglu, Yaman; Akhauri, Yash; Fraser, Nicholas J.; Blott, Michaela

doi:10.1109/fpl50879.2020.00055

Cited by 65 publications

(30 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same group have also developed a library for quantization-aware training, Brevitas [46], based on PyTorch model formats. The Log-icNets design flow [47], also from Xilinx Research Labs, allows for the training of quantized DNNs that map to highly efficient Xilinx FPGA implementations. A comparison between the approach presented here and Logic-Nets is provided in Section VII.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

et al. 2021

View full text Add to dashboard Cite

Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of O(1) µs is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. FIG.I. An ultra-compressed deep neural network for particle identification on a Xilinx FPGA.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Further, we compare the results obtained using the QKeras and hls4ml workflow to LogicNets [47]; another work on extreme low-latency, low-resource, fully-unfolded (II=1) FPGA implementations. The metrics are those quoted in Table III.…”

Section: Ultra Low-latency Quantized Model On Fpga Hardwarementioning

confidence: 99%

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Since, each unit requires only one BRAM access, it is faster than FINN-R. Also, it requires lower power, since our network requires fewer BRAMs than FINN-R. Note that this comparison excludes the softmax part [20].…”

Section: Comparison With Neural Networkmentioning

confidence: 99%

Classification Functions for Handwritten Digit Recognition

Sasao

Horikawa

Iguchi

2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

A classification function maps a set of vectors into several classes. A machine learning problem is treated as a design problem for partially defined classification functions. To realize classification functions for MNIST hand written digits, three different architectures are considered: Single-unit realization, 45-unit realization, and 45-unit ×r realization. The 45-unit realization consists of 45 ternary classifiers, 10 counters, and a max selector. Test accuracy of these architectures are compared using MNIST data set.

show abstract

“…With LUTNet, we reported area efficiency improvements of around 2× over ReBNet [3], the state-of-the-art BNN at the time, for problems of widely varying scale. More recent tools, including NullaNet [12] and LogicNets [18], also generate small LUTs as core components, but LUTNet remains unique in directly exposing a netlist's LUTs as differentiable functions trainable via stochastic gradient descent.…”

Section: Introductionmentioning

confidence: 99%

“…RELATED WORK 2.1 FPGA-Tailored DNN ArchitecturesLUT-based DNN inference accelerators have been shown to achieve remarkable performance when deployed on FPGAs. NullaNet[12] and LogicNets[18] were conceived with small-scale classification tasks in mind, for which they reached latency in the tens of nanoseconds and throughput in the hundreds of millions of samples per second. Going beyond FPGA-tailored network design, our previously proposed LUTNet topologies can be trained via stochastic gradient descent[20].…”

mentioning

confidence: 99%

Logic Shrinkage

Wang

Davis

Stavrou

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, 𝐾. Choosing appropriate 𝐾 a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling 𝐾 to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54× and 1.31×, respectively, while matching its accuracy. This implementation also reaches 2.71× the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67× vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs.

show abstract

LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications

Cited by 65 publications

References 15 publications

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Classification Functions for Handwritten Digit Recognition

Logic Shrinkage

Contact Info

Product

Resources

About