Finn

Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio; Blott, Michaela; Leong, Philip H. W.; Jahre, Magnus; Vissers, Kees

doi:10.1145/3020078.3021744

Cited by 714 publications

(123 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The hardware metrics of the residual network accelerator (RNA) are shown in Table 3, while the detials of resources utilization for each module are shown in [14] and [24], our accelerator with 4-bit logarithmic parameters outperforms the 16-bit fixed-point accelerators in terms of performance and accuracy, with 3 − 7× speedup and 1% − 4% accuracy gain, proving that our acceleration can effectively accelerate the residual networks. Some high throughput accelerators from [22] and [16], performing binarized neural network, outperform our accelerator in performance but cause significant accuracy reduction. Moreover, our accelerator can achieve [23], [25].…”

Section: Performance Of Rna Implementationmentioning

confidence: 97%

RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

Luo

Cao

Wang

et al. 2019

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

With the continuous refinement of Deep Neural Networks (DNNs), a series of deep and complex networks such as Residual Networks (ResNets) show impressive prediction accuracy in image classification tasks. Unfortunately, the structural complexity and computational cost of residual networks make hardware implementation difficult. In this paper, we present the quantized and reconstructed deep neural network (QR-DNN) technique, which first inserts batch normalization (BN) layers in the network during training, and later removes them to facilitate efficient hardware implementation. Moreover, an accurate and efficient residual network accelerator (RNA) is presented based on QR-DNN with batch-normalization-free structures and weights represented in a logarithmic number system. RNA employs a systolic array architecture to perform shift-and-accumulate operations instead of multiplication operations. QR-DNN is shown to achieve a 1% ∼ 2% improvement in accuracy over existing techniques, and RNA over previous best fixed-point accelerators. An FPGA implementation on a Xilinx Zynq XC7Z045 device achieves 804.03 GOPS, 104.15 FPS and 91.41% top-5 accuracy for the ResNet-50 benchmark, and state-of-the-art results are also reported for AlexNet and VGG. key words: deep learning, residual networks, software-hardware codesign, batch-normalization layers, FPGA

show abstract

Section: Performance Of Rna Implementationmentioning

confidence: 97%

RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

Luo

Cao

Wang

et al. 2019

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…The benefits, in terms of power and throughput, of fitting a design on-chip was Example of a reconfigurable multiplier with the coefficient set {12305, 20746}. described in [3]. Other FPGA architectures have been implemented to utilize the highly amenable nature of CNNs which constrain weight parameters to be only binary or ternary representations [29], [30].…”

Section: Related Workmentioning

confidence: 99%

“…m itself is bounded by its arbitrary unsigned 2's complement fixed-point representation where f is the number of fractional bits and hence m = 2 k− f − 2 − f . A summary of the training process is given in Algorithm 1, which is similar to [3] and [6], with the addition of distribution matching and incorporating the quantization scheme of (5).…”

Section: Algorithm 1 Training a Cnn Using Addnet Representationsmentioning

confidence: 99%

“…2019.2939429 computational complexity and memory requirements [1]. Field-programmable gate array (FPGA) implementations have demonstrated improved latency and power efficiency compared with central processing unit (CPU) and graphics processing unit (GPU) technologies (see [2] and [3]). In contrast to CPU/GPU technologies, they allow customized data paths, enabling improved parallelism and less data movement.…”

Section: Introductionmentioning

confidence: 99%

“…Optimizations via compression, quantization, and neural network layer explorations have been utilized to reduce complexity and boost performance (see [4] and [5]). In particular, quantizing inference networks to very low precision, such as constraining weight representations to binary or ternary values, both reduces memory requirements and enables multiplications to be replaced with the exclusive NOR operation [3], [6]. However, the disadvantage of extreme quantization is that the networks typically incur significant accuracy degradation for very low precisions, especially for complex problems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

AddNet: Deep Neural Networks Using FPGA-Optimized Multipliers

Faraone

Kumm

Hardieck

et al. 2020

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.Index Terms-Digital arithmetic, field programmable gate arrays (FPGAs), neural networks, neural network hardware, quantization. 1063-8210

show abstract