We present the implementation of binary and ternary neural networks in the hls4ml library, designed to automatically convert deep neural network models to digital circuits with FPGA firmware. Starting from benchmark models trained with floating point precision, we investigate different strategies to reduce the network's resource consumption by reducing the numerical precision of the network parameters to binary or ternary. We discuss the trade-off between model accuracy and resource consumption. In addition, we show how to balance between latency and accuracy by retaining full precision on a selected subset of network components. As an example, we consider two multiclass classification tasks: handwritten digit recognition with the MNIST data set and jet identification with simulated proton-proton collisions at the CERN Large Hadron Collider. The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.
We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on field-programmable gate arrays (FPGAs). By extending the hls4ml library, we demonstrate an inference latency of 5 µs using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in order to fit the computational constraints of a typical FPGA device used in trigger and data acquisition systems of particle detectors. In particular, we discuss pruning and quantization-aware training, and demonstrate how resource utilization can be significantly reduced with little to no loss in model accuracy. We show that the FPGA critical resource consumption can be reduced by 97% with zero loss in model accuracy, and by 99% when tolerating a 6% accuracy degradation.
We describe the implementation of Boosted Decision Trees in the hls4ml library, which allows the translation of a trained model into FPGA firmware through an automated conversion process. Thanks to its fully on-chip implementation, hls4ml performs inference of Boosted Decision Tree models with extremely low latency. With a typical latency less than 100 ns, this solution is suitable for FPGA-based real-time processing, such as in the Level-1 Trigger system of a collider experiment. These developments open up prospects for physicists to deploy BDTs in FPGAs for identifying the origin of jets, better reconstructing the energies of muons, and enabling better selection of rare signal processes.
In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an areaefficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called MNEMOSYNE, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (PERFECT and CORTEXSUITE). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately. Index Terms-Hardware accelerator, high-level synthesis (HLS), memory design, multibank architecture. I. INTRODUCTION S YSTEM-ON-CHIP (SoC) architectures increasingly feature hardware accelerators to achieve energy-efficient high performance [1]. Complex applications leverage these specialized components to improve the execution of selected computational kernels [2], [3]. For example, hardware accelerators for machine learning applications are increasingly used to identify underlying relations in massive unstructured data [4]-[6]. Many of these algorithms first build an internal model by analyzing very large data sets; then, they leverage Manuscript
Graph neural networks have been shown to achieve excellent performance for several crucial tasks in particle physics, such as charged particle tracking, jet tagging, and clustering. An important domain for the application of these networks is the FGPA-based first layer of real-time data filtering at the CERN Large Hadron Collider, which has strict latency and resource constraints. We discuss how to design distance-weighted graph networks that can be executed with a latency of less than one μs on an FPGA. To do so, we consider a representative task associated to particle reconstruction and identification in a next-generation calorimeter operating at a particle collider. We use a graph network architecture developed for such purposes, and apply additional simplifications to match the computing constraints of Level-1 trigger systems, including weight quantization. Using the hls4ml library, we convert the compressed models into firmware to be implemented on an FPGA. Performance of the synthesized models is presented both in terms of inference accuracy and resource usage.
Emerging technologies provide SoCs with fine-grained DVFS capabilities both in space (number of domains) and time (transients in the order of tens of nanoseconds). Analyzing these systems requires cycle-accurate accounting of rapidly-changing dynamics and complex interactions among accelerators, interconnect, memory, and OS. We present an FPGA-based infrastructure that facilitates such analyses for high-performance embedded systems. We show how our infrastructure can be used to first generate SoCs with looselycoupled accelerators, and then perform design-space exploration considering several DVFS policies under full-system workload scenarios, sweeping spatial and temporal domain granularity.
ESP is an open-source research platform for heterogeneous SoC design. The platform combines a modular tile-based architecture with a variety of application-oriented flows for the design and optimization of accelerators. The ESP architecture is highly scalable and strikes a balance between regularity and specialization. The companion methodology raises the level of abstraction to system-level design and enables an automated flow from software and hardware development to full-system prototyping on FPGA. For application developers, ESP offers domain-specific automated solutions to synthesize new accelerators for their software and to map complex workloads onto the SoC architecture. For hardware engineers, ESP offers automated solutions to integrate their accelerator designs into the complete SoC. Conceived as a heterogeneous integration platform and tested through years of teaching at Columbia University, ESP supports the open-source hardware community by providing a flexible platform for agile SoC development. CCS CONCEPTS• Computer systems organization → System on a chip; • Hardware → Methodologies for EDA.
This paper presents the custom implementation, optimization, and performance evaluation of convolutional neural networks on field programmable gate arrays, for the purposes of accelerating deep neural network inference on large, twodimensional image inputs. The targeted application is that of data selection for high-resolution particle imaging detectors, and in particular liquid argon time projection chamber detectors, such as that employed by the future Deep Underground Neutrino Experiment. We motivate this particular application based on the excellent performance of deep neural networks on classifying simulated raw data from the DUNE LArTPC, combined with the need for power-efficient data processing in the case of remote, long-term, and limited-access operating detector conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.