A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform

Moss, D.; Krishnan, Srivatsan; Nurvitadhi, Eriko; Ratuszniak, P.; Johnson, Constance A.; Sim, Jaewoong; Mishra, Asit K.; Marr, Debbie; Subhaschandra, Suchit; Leong, Philip H. W.

doi:10.1145/3174243.3174258

Cited by 65 publications

(41 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Shi CNN was shown to obtain 4.2× and 3.8× energy e ciency savings over two baseline CNN platforms using DSP-and LUT-based bit-parallel MACs, respectively. Moss et al presented an FPGA-based customisable matrix multiplication framework dedicated to DNN inference [100]. eir implementation allows for the runtime switching between static-precision bit-parallel and dynamic-precision bit-serial MAC implementations.…”

Section: Fixed-point Representationmentioning

confidence: 99%

Deep Neural Network Approximation for Custom Hardware

et al. 2019

View full text Add to dashboard Cite

LondonDeep neural networks have proven to be particularly e ective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardwareoriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy e ciency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-e cient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their e ectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. is article represents the rst survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the eld.

show abstract

Section: Fixed-point Representationmentioning

confidence: 99%

Deep Neural Network Approximation for Custom Hardware

et al. 2019

View full text Add to dashboard Cite

show abstract

“…described in [3]. Other FPGA architectures have been implemented to utilize the highly amenable nature of CNNs which constrain weight parameters to be only binary or ternary representations [29], [30]. With restrictions in the efficiency of both software and hardware implementations of neural networks, software-hardware codesign is considered an effective approach to achieve optimal performance [31], [32].…”

Section: Related Workmentioning

confidence: 99%

AddNet: Deep Neural Networks Using FPGA-Optimized Multipliers

Faraone

Kumm

Hardieck

et al. 2020

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.Index Terms-Digital arithmetic, field programmable gate arrays (FPGAs), neural networks, neural network hardware, quantization. 1063-8210

show abstract

“…Many existing frameworks [9], [25], [31], [23], [33], [39], that map CNN models to FPGAs generate a large homogeneous processing core that is temporally shared among layers. This common design is flexible, as by sequentially carrying out convolutions, it is less constrained by the amount of resources available on FPGAs.…”

Section: Related Workmentioning

confidence: 99%

Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs

Zhao

Gao

Guo

et al. 2019

2019 International Conference on Field-Programmable Technology (ICFPT)

View full text Add to dashboard Cite

Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic autogeneration framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4× for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.

show abstract

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform

Cited by 65 publications

References 16 publications

Deep Neural Network Approximation for Custom Hardware

Deep Neural Network Approximation for Custom Hardware

AddNet: Deep Neural Networks Using FPGA-Optimized Multipliers

Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs

Contact Info

Product

Resources

About